HTML parser (again) (again)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML parser (again) (again)

Sean P. DeNigris
Administrator
All the threads in the mailing list seem to die off unresolved.  What are the options available in current Squeak, and what are the differences?

Here's my experience:
1. HTML (Squeaksource) - parser pulled out of Scamper.  Loads in Squeak 4.1, seems awkward, but maybe I don't know how to use it.  Here's a snippet I wrote that worked:
  doc := HtmlParser parse: htmlString.
  thread := doc body subEntities detect: [ :e | e id = 'thread' ].
  subject_header := thread contents detect: [ :e | e id = 'message_heading' ].
  subject := (subject_header contents at: 1) text.
2. HTML & CSS Validating Parser (Squeaksource) - It loads, but I don't have the slightest clue how to use it.  I found references to people using it.  They must be Alan Kay's close relatives, or live in machine world like the Lawnmower Man because I couldn't find a shred of documentation or even one class that looked plausible as a starting point.
3. Soup
  * In 4.1
     - "Installer squeaksource project: 'Soup'; install: 'Soup'" ->"depends on... RxMatcher" warning
     - "Installer squeaksource project: 'Soup'; install: 'Network-Protocols'", then Soup -> "NonBooleanReceiver: proceed for truth" (See [1] for log).

Also, two general points that would put turbo boosters behind the community:
<rant>
1. I know this is totally ungrateful, but please, if you design a library and are nice enough to release it for free to the community, take 10 seconds and at least add an XxxInfo category and class with a simple example in the class comment (if you don't have time for HelpSystem, etc.).  It can be worse than not having a library, to spend hours trying to use one, and never have it work.
2. If you ask questions, and the community rallies behind you on the mailing list, IRC, etc., and you solve the problem - hooray!  Please show your gratitude and pay it forward by sharing the solution on the list, so that others who have the same situation can benefit, and we can all learn without exhausting the time and energy of the gurus having to explain the same thing over and over.
</rant>

Thank you.
Sean

[1] Log: NonBooleanReceiver: proceed for truth.
28 October 2010 9:31:15.328 pm

VM: Mac OS - Smalltalk
Image: Squeak4.1 [latest update: #9957]

SecurityManager state:
Restricted: false
FileAccess: true
SocketAccess: true
Working Dir /Users/sean/Squeak/Fresh Images/Squeak4.1
Trusted Dir /foobar/tooBar/forSqueak/bogus
Untrusted Dir /Users/sean/Library/Preferences/Squeak/Internet/My Squeak

HTTPSocket(Object)>>mustBeBooleanIn:
        Receiver: a HTTPSocket[connected]
        Arguments and temporary variables:
                context: [] in HTTPSocket class>>httpGetDocument:args:accept:request:
                proceedValue: nil
        Receiver's instance variables:
                semaphore: a Semaphore()
                socketHandle: #[183 120 202 76 0 0 0 0 144 223 51 0]
                readSemaphore: a Semaphore()
                writeSemaphore: a Semaphore()
                primitiveOnlySupportsOneSemaphore: false
                headerTokens: nil
                headers: nil
                responseCode: nil

HTTPSocket(Object)>>mustBeBoolean
        Receiver: a HTTPSocket[connected]
        Arguments and temporary variables:

        Receiver's instance variables:
                semaphore: a Semaphore()
                socketHandle: #[183 120 202 76 0 0 0 0 144 223 51 0]
                readSemaphore: a Semaphore()
                writeSemaphore: a Semaphore()
                primitiveOnlySupportsOneSemaphore: false
                headerTokens: nil
                headers: nil
                responseCode: nil

[] in HTTPSocket class>>httpGetDocument:args:accept:request:
        Receiver: HTTPSocket
        Arguments and temporary variables:
<<error during printing>
        Receiver's instance variables:
                superclass: Socket
                methodDict: a MethodDictionary(#contentType->(HTTPSocket>>#contentType "a Compi...etc...
                format: 146
                instanceVariables: #('headerTokens' 'headers' 'responseCode')
                organization: ('accessing' contentType contentType: contentsLength: getHeader: ...etc...
                subclasses: nil
                name: #HTTPSocket
                classPool: a Dictionary(#HTTPBlabEmail->'' #HTTPPort->80 #HTTPProxyCredentials-...etc...
                sharedPools: nil
                environment: Smalltalk globals "a SystemDictionary with lots of globals"
                category: #'Network-Protocols'

SmallInteger(Integer)>>timesRepeat:
        Receiver: 3
        Arguments and temporary variables:
                aBlock: [closure] in HTTPSocket class>>httpGetDocument:args:accept:request:
                count: 1
        Receiver's instance variables:
3

--- The full stack ---
HTTPSocket(Object)>>mustBeBooleanIn:
HTTPSocket(Object)>>mustBeBoolean
[] in HTTPSocket class>>httpGetDocument:args:accept:request:
SmallInteger(Integer)>>timesRepeat:
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HTTPSocket class>>httpGetDocument:args:accept:request:
HTTPSocket class>>httpGet:args:accept:request:
HTTPSocket class>>httpGet:args:user:passwd:
[] in MCHttpRepository>>allFileNames
[] in [] in MCHttpRepository>>displayProgress:during:
BlockClosure>>on:do:
[] in MCHttpRepository>>displayProgress:during:
[] in [] in ProgressInitiationException>>defaultMorphicAction
BlockClosure>>on:do:
[] in ProgressInitiationException>>defaultMorphicAction
BlockClosure>>ensure:
ProgressInitiationException>>defaultMorphicAction
ProgressInitiationException>>defaultAction
UndefinedObject>>handleSignal:
MethodContext(ContextPart)>>handleSignal:
MethodContext(ContextPart)>>handleSignal:
MethodContext(ContextPart)>>handleSignal:
ProgressInitiationException(Exception)>>signal
ProgressInitiationException>>display:at:from:to:during:
ProgressInitiationException class>>display:at:from:to:during:
MorphicUIManager>>displayProgress:at:from:to:during:
MCHttpRepository>>displayProgress:during:
MCHttpRepository>>allFileNames
MCHttpRepository(MCFileBasedRepository)>>allFileNamesOrCache
MCHttpRepository(MCFileBasedRepository)>>readableFileNames
InstallerMonticello>>mcThing
[] in InstallerMonticello>>basicInstall
[] in BlockClosure>>valueSupplyingAnswers:
BlockClosure>>on:do:
BlockClosure>>valueSupplyingAnswers:
BlockClosure>>valueSuppressingMessages:supplyingAnswers:
InstallerMonticello(Installer)>>withAnswersDo:
InstallerMonticello>>basicInstall
[] in InstallerMonticello(Installer)>>installLogging
InstallerMonticello(Installer)>>logErrorDuring:
InstallerMonticello(Installer)>>installLogging
InstallerMonticello(Installer)>>install
InstallerMonticello(Installer)>>install:
UndefinedObject>>DoIt
Compiler>>evaluate:in:to:notifying:ifFail:logged:
[] in SmalltalkEditor(TextEditor)>>evaluateSelection
BlockClosure>>on:do:
SmalltalkEditor(TextEditor)>>evaluateSelection
SmalltalkEditor(TextEditor)>>doIt
SmalltalkEditor(TextEditor)>>doIt:
...etc...
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Peter Kenny
Sean P. DeNigris wrote
All the threads in the mailing list seem to die off unresolved.  What are the options available in current Squeak, and what are the differences?

2. HTML & CSS Validating Parser (Squeaksource) - It loads, but I don't have the slightest clue how to use it.  I found references to people using it.  They must be Alan Kay's close relatives, or live in machine world like the Lawnmower Man because I couldn't find a shred of documentation or even one class that looked plausible as a starting point.
Sean

I can't tell you about the others, but in my opinion this is the most brilliant parser ever. The starting point is the class HtmlValidator; read the class comment to see how to begin it. Maybe the clue is meant to be in the name; this is a *validating* parser.

Just a couple of points:

1. It will work fine if you are loading from the web; it will load any relevant CSS and take it into account. This uses the onUrl: method. It will parse a string of HTML from your system, using the on: method, but this falls over if the HTML is a downloaded web page including a reference to CSS on the web. There may be a way round this, but I haven't found it.

2. It will fail in the current Squeak 4.1, because this version has fouled up the concatenation of strings. I have been arguing with the Squeak maintainers that what they have done is nuts, but they are sticking with their changes. The details are on Mantis issue no.7564, if you have access to that. There are two possible work rounds:

a. If you have access to Mantis (http://bugs.squeak.org/view_all_bug_page.php), go to the details of issue 7564, download the change set posted by Andreas Raab and file it into Squeak 4.1.

b. Having loaded the parser, find the method HtmlDOMNode>>parseContents: and edit it as follows: find the two occurrences of the expression ('/', Character separators) and change each of them to (Character separators, '/'). I know each of these is valid Smalltalk and they should have the same effect, but in Squeak 4.1 they don't; that's why I say it's nuts.

Your rant may have some validity, but we must be realistic; if you have written a package, you know how to use it, and writing detailed instructions for someone else is a pain. Todd Blanchard wrote this in about 2006, and it has been on Squeaksource ever since. I downloaded it then, adapted it to run in Dolphin Smalltalk, and have used it ever since.

Any questions, ask again.

Peter Kenny

Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Sean P. DeNigris
Administrator
Peter Kenny wrote
It will parse a string of HTML from your system, using the on: method, but this falls over if the HTML is a downloaded web page including a reference to CSS on the web. There may be a way round this, but I haven't found it.
Of course this is what i needed to use it for, lol.

Peter Kenny wrote
b. Having loaded the parser, find the method HtmlDOMNode>>parseContents: and edit it as follows: find the two occurrences of the expression ('/', Character separators) and change each of them to (Character separators, '/').
This worked, and I parsed google.com in 4.1.  So I can't use it because of on: string, but it's nice to know it's an option in the future.

Peter Kenny wrote
if you have written a package, you know how to use it, and writing detailed instructions for someone else is a pain.
That's why I said that if one doesn't have time to do it properly, throw a paragraph into an Info package class comment.  Compared to the cost of developing the library, this is negligible.  It probably takes longer to create the project on SqS and upload :)

Thanks.
Sean
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Göran Krampe
Hi!

For webscraping etc, take a look at WebRobot at SqueakSource:

http://www.squeaksource.com/WebRobot.html

...I hastily broke out that code from a robot system I wrote for a
client, so no guarantees. But it shows how to use Todd's HTML parser:

http://map.squeak.org/package/e5f9003d-a8ea-47fc-b9a1-adf04a47aefa

...with HTTPClient:

http://map.squeak.org/packagebyname/HTTPClient

..as it involves a WRDummyLoader etc in order to make it work.

regards, Göran

Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Göran Krampe
...and if anyone wants more details I can take a look in the image.

The robot has been running fine (for lots of concurrent users driving it
from a Seaside "wisard") and does fairly advanced "remoting" with form
POSTS etc, all using stunnel to access a HTTPS secure site.

regards, Göran

Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Peter Kenny
In reply to this post by Sean P. DeNigris
Sean P. DeNigris wrote
Peter Kenny wrote
It will parse a string of HTML from your system, using the on: method, but this falls over if the HTML is a downloaded web page including a reference to CSS on the web. There may be a way round this, but I haven't found it.
Of course this is what i needed to use it for, lol.
Sean

Because this is obviously a deal-breaker for you, I have looked a bit more closely at the effect of the on: method. My comments may have been coloured by my earlier experiences with my Dolphin adaptation; I have only recently started experimenting with the parser on Squeak/Pharo, and I have not tried it as widely. I have today experimented on Pharo (I'm sure Squeak would show the same), and I can say that my statement above is too strong. The parser *may* fall over in some circumstances, but it will work in many cases. Specifically, it should work OK if any CSS references in the text are to a full absolute URL (i.e. everything from the http: onward). This makes sense to me; if it is a relative address, the parser would not know the absolute root to base it on.

I think it would be worth your while to try out some of your HTML strings with '(HtmlValidator on: aString) dom' and see if they work. If you are using relative addresses for CSS files, it could be worth while editing them to the full version just to get it to parse.

BTW, if you think of trying the parser in Pharo, note that you will first have to patch Pharo as suggested in http://code.google.com/p/pharo/issues/detail?id=2797.

As to your point on documentation, I agree that it is useful to give a basic pointer to how to use the package. My Dolphin version of the parser has a package comment which quotes the full description from Squeaksource and adds a pointer to the HtmlValidator class comment. There does not seem to be the same adoption of package comments with Squeak packages; the ones I have seen seem to be just a few words explaining what is changed in this version. Is there any way to give an extended description in a Squeak package?

Hope this helps.

Peter Kenny

Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Göran Krampe
On 10/30/2010 06:22 PM, Peter Kenny wrote:

> Sean P. DeNigris wrote:
>> Peter Kenny wrote:
>>>
>>> It will parse a string of HTML from your system, using the on: method,
>>> but this falls over if the HTML is a downloaded web page including a
>>> reference to CSS on the web. There may be a way round this, but I haven't
>>> found it.
>>>
>> Of course this is what i needed to use it for, lol.
>>

If it is about preventing loading referenced stuff etc, then my WebRobot
code does that using a DummyLoader class.

regards, Göran


Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Peter Kenny
Göran Krampe wrote
If it is about preventing loading referenced stuff etc, then my WebRobot
code does that using a DummyLoader class.

regards, Göran
Göran

Actually, the idea I had was that it would be *necessary* to load the referenced CSS files, in case the parser depended on them to correctly parse the HTML. Your comment implies that this may not be so. Nevertheless, I was interested in finding the conditions in which Sean could use the parser 'as is' for his application. If that does not meet what he needs, he could no doubt look at your code.

Peter Kenny
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

EntheoWizard
In reply to this post by Göran Krampe
Hello Göran,

I am rekindling my 30+ year old love affair with Smalltalk and I came across your post regarding WebRobot and your offer "...and if anyone wants more details I can take a look in the image" :-)

I'm trying WRExampleRobot and I'm stuck right out of the gate... :-(

I would really appreciate it if you could point me in the right direction to solve this - thanks :-)


I changed the URL in login to www.google.com and then tried:
t := WRExampleRobot new
t login

which results in:

BlockClosure(Object)>>doesNotUnderstand: #deferredValue

while trying to process the send message:

send
        "Send the request with no entity (typical for a GET Request).
        This method returns immediately.
        You can either:
                -test with #isReady (and when true access the #responseBody).
                -wait (block) on #waitOnReady or  #waitOnReadyCancelling (and then access #responseBody)"

        "Note: If you are debuging an exception that occurs in the deferred send, you can
        uncomment the 'self halt' to open a debugger on the problem"

        deferredSend :=
                        ["["self sendMessages"] on: Error
                                do:
                                        [:e |
                                       
                                        e]"]
                                        deferredValue
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser (again) (again)

Sean P. DeNigris
Administrator
EntheoWizard wrote
Hello Göran,

I am rekindling my 30+ year old love affair with Smalltalk and I came across your post regarding WebRobot and your offer "...and if anyone wants more details I can take a look in the image" :-)

I'm trying WRExampleRobot and I'm stuck right out of the gate... :-(

I would really appreciate it if you could point me in the right direction to solve this - thanks :-)


I changed the URL in login to www.google.com and then tried:
t := WRExampleRobot new
t login

which results in:

BlockClosure(Object)>>doesNotUnderstand: #deferredValue

while trying to process the send message:

send
        "Send the request with no entity (typical for a GET Request).
        This method returns immediately.
        You can either:
                -test with #isReady (and when true access the #responseBody).
                -wait (block) on #waitOnReady or  #waitOnReadyCancelling (and then access #responseBody)"

        "Note: If you are debuging an exception that occurs in the deferred send, you can
        uncomment the 'self halt' to open a debugger on the problem"

        deferredSend :=
                        ["["self sendMessages"] on: Error
                                do:
                                        [:e |
                                       
                                        e]"]
                                        deferredValue
Cheers,
Sean