Parsing and navigating Html

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing and navigating Html

Stéphane Ducasse
Hi

I would like extract partial information from HTML pages.
So I started to play and fix HTMLParser.

        Now does anybody know alternatives?
       
Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?

Stef
Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Sean P. DeNigris
Administrator
I recently got the following very detailed response on the Squeak list - http://forum.world.st/HTML-parser-again-again-td3018595.html

In addition to what's discussed there, I've been successfully using PetitParser with/in lieu of HTML parsers for partial parsing.

Cheers,
Sean
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Schwab,Wilhelm K
In reply to this post by Stéphane Ducasse
Stef,

Are the html pages produced under your control?  I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers.  

Bill


________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]]
Sent: Saturday, December 04, 2010 3:48 PM
To: [hidden email] Development
Subject: [Pharo-project] Parsing and navigating Html

Hi

I would like extract partial information from HTML pages.
So I started to play and fix HTMLParser.

        Now does anybody know alternatives?

Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?

Stef

Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Johan Brichau-2
In reply to this post by Sean P. DeNigris
SeasideTesting uses the XMLParser for parsing html files and has support for querying them.
It's not complete but, for example, covers quite a lot of cases needed in testing, obviously

On 04 Dec 2010, at 22:15, Sean P. DeNigris wrote:

>
> I recently got the following very detailed response on the Squeak list -
> http://forum.world.st/HTML-parser-again-again-td3018595.html
>
> In addition to what's discussed there, I've been successfully using
> PetitParser with/in lieu of HTML parsers for partial parsing.
>
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html
> Sent from the Pharo Smalltalk mailing list archive at Nabble.com.
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Stéphane Ducasse
In reply to this post by Sean P. DeNigris
Thanks.

- I looked at HtmlParser (I would like to have a visitor so that all the morphic code for the rendering is removed from the node.

- I was planning to look at CSS/parser validator.

- I will look at SOUP and webrobots

thanks for the pointers.

Stef


On Dec 4, 2010, at 10:15 PM, Sean P. DeNigris wrote:

>
> I recently got the following very detailed response on the Squeak list -
> http://forum.world.st/HTML-parser-again-again-td3018595.html
>
> In addition to what's discussed there, I've been successfully using
> PetitParser with/in lieu of HTML parsers for partial parsing.
>
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html
> Sent from the Pharo Smalltalk mailing list archive at Nabble.com.
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Stéphane Ducasse
In reply to this post by Schwab,Wilhelm K
no just some applications I do not control. :)

On Dec 5, 2010, at 3:21 AM, Schwab,Wilhelm K wrote:

> Stef,
>
> Are the html pages produced under your control?  I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers.  
>
> Bill
>
>
> ________________________________________
> From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]]
> Sent: Saturday, December 04, 2010 3:48 PM
> To: [hidden email] Development
> Subject: [Pharo-project] Parsing and navigating Html
>
> Hi
>
> I would like extract partial information from HTML pages.
> So I started to play and fix HTMLParser.
>
>        Now does anybody know alternatives?
>
> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>
> Stef
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Stéphane Ducasse
In reply to this post by Johan Brichau-2

On Dec 5, 2010, at 12:13 PM, Johan Brichau wrote:

> SeasideTesting uses the XMLParser for parsing html files and has support for querying them.
> It's not complete but, for example, covers quite a lot of cases needed in testing, obviously

good to know.
Now I tried to parse some pages with XMLParser before trying HTMLParser and it did not succeed.

Stef

>
> On 04 Dec 2010, at 22:15, Sean P. DeNigris wrote:
>
>>
>> I recently got the following very detailed response on the Squeak list -
>> http://forum.world.st/HTML-parser-again-again-td3018595.html
>>
>> In addition to what's discussed there, I've been successfully using
>> PetitParser with/in lieu of HTML parsers for partial parsing.
>>
>> Cheers,
>> Sean
>> --
>> View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html
>> Sent from the Pharo Smalltalk mailing list archive at Nabble.com.
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Paul DeBruicker
In reply to this post by Stéphane Ducasse
On 12/05/2010 06:02 AM, [hidden email] wrote:

> From: St?phane Ducasse<[hidden email]>
> Subject: [Pharo-project] Parsing and navigating Html
> To:"[hidden email] Development"
> <[hidden email]>
> Message-ID:<[hidden email]>
> Content-Type: text/plain; charset=us-ascii
>
> Hi
>
> I would like extract partial information from HTML pages.
> So I started to play and fix HTMLParser.
>
> Now does anybody know alternatives?
>
> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>
> Stef


The Soup package on SqueakSource.com works OK.

Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Schwab,Wilhelm K
In reply to this post by Stéphane Ducasse
Stef,

Do you at least have a list of specific applications producing the data?  Do you know and trust the people running them?  That might be a nice position compared to data mining the net (like Google) where you have no idea what might arrive, how well it might comply with standards, all the time having to worry about malicious intent.  For example, I do not recommend putting untrusted data through a Microsoft parsing library  ~:0

Bill


________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]]
Sent: Sunday, December 05, 2010 10:21 AM
To: [hidden email]
Subject: Re: [Pharo-project] Parsing and navigating Html

no just some applications I do not control. :)

On Dec 5, 2010, at 3:21 AM, Schwab,Wilhelm K wrote:

> Stef,
>
> Are the html pages produced under your control?  I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers.
>
> Bill
>

Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Stéphane Ducasse
In reply to this post by Paul DeBruicker
thanks paul I will have a look now.

Stef



> On 12/05/2010 06:02 AM, [hidden email] wrote:
>> From: St?phane Ducasse<[hidden email]>
>> Subject: [Pharo-project] Parsing and navigating Html
>> To:"[hidden email] Development"
>> <[hidden email]>
>> Message-ID:<[hidden email]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Hi
>>
>> I would like extract partial information from HTML pages.
>> So I started to play and fix HTMLParser.
>>
>> Now does anybody know alternatives?
>>
>> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>>
>> Stef
>
>
> The Soup package on SqueakSource.com works OK.
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Stéphane Ducasse
In reply to this post by Paul DeBruicker
Hi paul

There is not one single class comments :(
Do you have an example?
I got two tests red, it is normal?

I'm sad because this is really not like that, that we will attract people to Smalltalk.

Stef



On Dec 5, 2010, at 5:19 PM, Paul DeBruicker wrote:

> On 12/05/2010 06:02 AM, [hidden email] wrote:
>> From: St?phane Ducasse<[hidden email]>
>> Subject: [Pharo-project] Parsing and navigating Html
>> To:"[hidden email] Development"
>> <[hidden email]>
>> Message-ID:<[hidden email]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Hi
>>
>> I would like extract partial information from HTML pages.
>> So I started to play and fix HTMLParser.
>>
>> Now does anybody know alternatives?
>>
>> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>>
>> Stef
>
>
> The Soup package on SqueakSource.com works OK.
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Stéphane Ducasse
In reply to this post by Paul DeBruicker
Hi paul

I started to write some comments and more tests as well as writing a small Documentation.
May be to be turned into a Help later.

Now a question, I tried to write a comment for and I was wondering why not returning '' instead of nil in string.
It will avoid to have isNil everywhere. I did not touch that code for now.

string
        ^ (children size = 1 and: [(children at: 1) isString])
                                ifTrue: [(children at: 1) contents]
                                ifFalse: [nil]


testString
        | soup |
        soup := Soup fromString: '<b>foo</b>'.
        self assert: (soup b string = 'foo').
        soup := Soup fromString: '<b>f<i>e</i>o</b>'.
        self assert: (soup b string) isNil

Stef
Reply | Threaded
Open this post in threaded view
|

Re: Parsing and navigating Html

Sean P. DeNigris
Administrator
Stéphane Ducasse wrote
> Now a question, I tried to write a comment for and I was wondering why not
> returning '' instead of nil in string.

Ten years after your question, I stumbled upon a SO answer [1] that seems to
explain this behavior. Apparently `string` returns a string-like object
*that retains tree navigation ability* e.g. sibling, parent, where `text`
returns a bare string. From this perspective, `nil` seems appropriate to
indicate that such a navigable object was not available. That said, it
certainly can make the library harder to deal with. What about a Null Object
Pattern that returns an object polymorphic with SoupString and that
implements appropriate no-ops?

1. https://stackoverflow.com/a/25328374



-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Cheers,
Sean