Smalltalk › Pharo › Pharo Smalltalk Developers

Parsing and navigating Html

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

Stéphane Ducasse

Parsing and navigating Html

Hi

I would like extract partial information from HTML pages.
So I started to play and fix HTMLParser.

Now does anybody know alternatives?

Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?

Stef

Sean P. DeNigris

Re: Parsing and navigating Html

Administrator

I recently got the following very detailed response on the Squeak list - http://forum.world.st/HTML-parser-again-again-td3018595.html

In addition to what's discussed there, I've been successfully using PetitParser with/in lieu of HTML parsers for partial parsing.

Cheers,
Sean

Cheers,
Sean

Schwab,Wilhelm K

Re: Parsing and navigating Html

In reply to this post by Stéphane Ducasse

Stef,

Are the html pages produced under your control? I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers.

Bill

________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]]
Sent: Saturday, December 04, 2010 3:48 PM
To: [hidden email] Development
Subject: [Pharo-project] Parsing and navigating Html

Hi

I would like extract partial information from HTML pages.
So I started to play and fix HTMLParser.

Now does anybody know alternatives?

Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?

Stef

Johan Brichau-2

Re: Parsing and navigating Html

In reply to this post by Sean P. DeNigris

SeasideTesting uses the XMLParser for parsing html files and has support for querying them.
It's not complete but, for example, covers quite a lot of cases needed in testing, obviously

On 04 Dec 2010, at 22:15, Sean P. DeNigris wrote:

>
> I recently got the following very detailed response on the Squeak list -
> http://forum.world.st/HTML-parser-again-again-td3018595.html
>
> In addition to what's discussed there, I've been successfully using
> PetitParser with/in lieu of HTML parsers for partial parsing.
>
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html
> Sent from the Pharo Smalltalk mailing list archive at Nabble.com.
>

Stéphane Ducasse

Re: Parsing and navigating Html

In reply to this post by Sean P. DeNigris

Thanks.

- I looked at HtmlParser (I would like to have a visitor so that all the morphic code for the rendering is removed from the node.

- I was planning to look at CSS/parser validator.

- I will look at SOUP and webrobots

thanks for the pointers.

Stef

On Dec 4, 2010, at 10:15 PM, Sean P. DeNigris wrote:

Stéphane Ducasse

Re: Parsing and navigating Html

In reply to this post by Schwab,Wilhelm K

no just some applications I do not control. :)

On Dec 5, 2010, at 3:21 AM, Schwab,Wilhelm K wrote:

> Stef,
>
> Are the html pages produced under your control? I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers.
>
> Bill
>
>
> ________________________________________
> From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]]
> Sent: Saturday, December 04, 2010 3:48 PM
> To: [hidden email] Development
> Subject: [Pharo-project] Parsing and navigating Html
>
> Hi
>
> I would like extract partial information from HTML pages.
> So I started to play and fix HTMLParser.
>
> Now does anybody know alternatives?
>
> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>
> Stef
>

Stéphane Ducasse

Re: Parsing and navigating Html

In reply to this post by Johan Brichau-2

On Dec 5, 2010, at 12:13 PM, Johan Brichau wrote:

> SeasideTesting uses the XMLParser for parsing html files and has support for querying them.
> It's not complete but, for example, covers quite a lot of cases needed in testing, obviously

good to know.
Now I tried to parse some pages with XMLParser before trying HTMLParser and it did not succeed.

Stef

>
> On 04 Dec 2010, at 22:15, Sean P. DeNigris wrote:
>
>>
>> I recently got the following very detailed response on the Squeak list -
>> http://forum.world.st/HTML-parser-again-again-td3018595.html
>>
>> In addition to what's discussed there, I've been successfully using
>> PetitParser with/in lieu of HTML parsers for partial parsing.
>>
>> Cheers,
>> Sean
>> --
>> View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html
>> Sent from the Pharo Smalltalk mailing list archive at Nabble.com.
>>
>
>

Paul DeBruicker

Re: Parsing and navigating Html

In reply to this post by Stéphane Ducasse

On 12/05/2010 06:02 AM, [hidden email] wrote:

> From: St?phane Ducasse<[hidden email]>
> Subject: [Pharo-project] Parsing and navigating Html
> To:"[hidden email] Development"
> <[hidden email]>
> Message-ID:<[hidden email]>
> Content-Type: text/plain; charset=us-ascii
>
> Hi
>
> I would like extract partial information from HTML pages.
> So I started to play and fix HTMLParser.
>
> Now does anybody know alternatives?
>
> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>
> Stef

The Soup package on SqueakSource.com works OK.

Schwab,Wilhelm K

Re: Parsing and navigating Html

In reply to this post by Stéphane Ducasse

Stef,

Do you at least have a list of specific applications producing the data? Do you know and trust the people running them? That might be a nice position compared to data mining the net (like Google) where you have no idea what might arrive, how well it might comply with standards, all the time having to worry about malicious intent. For example, I do not recommend putting untrusted data through a Microsoft parsing library ~:0

Bill

________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]]
Sent: Sunday, December 05, 2010 10:21 AM
To: [hidden email]
Subject: Re: [Pharo-project] Parsing and navigating Html

no just some applications I do not control. :)

On Dec 5, 2010, at 3:21 AM, Schwab,Wilhelm K wrote:

> Stef,
>
> Are the html pages produced under your control? I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers.
>
> Bill
>

Stéphane Ducasse

Re: Parsing and navigating Html

In reply to this post by Paul DeBruicker

thanks paul I will have a look now.

Stef

> On 12/05/2010 06:02 AM, [hidden email] wrote:
>> From: St?phane Ducasse<[hidden email]>
>> Subject: [Pharo-project] Parsing and navigating Html
>> To:"[hidden email] Development"
>> <[hidden email]>
>> Message-ID:<[hidden email]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Hi
>>
>> I would like extract partial information from HTML pages.
>> So I started to play and fix HTMLParser.
>>
>> Now does anybody know alternatives?
>>
>> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package?
>>
>> Stef
>
>
> The Soup package on SqueakSource.com works OK.
>

Stéphane Ducasse

Re: Parsing and navigating Html

In reply to this post by Paul DeBruicker

Hi paul

There is not one single class comments :(
Do you have an example?
I got two tests red, it is normal?

I'm sad because this is really not like that, that we will attract people to Smalltalk.

Stef

On Dec 5, 2010, at 5:19 PM, Paul DeBruicker wrote:

Stéphane Ducasse

Re: Parsing and navigating Html

In reply to this post by Paul DeBruicker

Hi paul

I started to write some comments and more tests as well as writing a small Documentation.
May be to be turned into a Help later.

Now a question, I tried to write a comment for and I was wondering why not returning '' instead of nil in string.
It will avoid to have isNil everywhere. I did not touch that code for now.

string
^ (children size = 1 and: [(children at: 1) isString])
ifTrue: [(children at: 1) contents]
ifFalse: [nil]

testString
| soup |
soup := Soup fromString: 'foo'.
self assert: (soup b string = 'foo').
soup := Soup fromString: 'feo'.
self assert: (soup b string) isNil

Stef

Sean P. DeNigris

Re: Parsing and navigating Html

Administrator

Stéphane Ducasse wrote
> Now a question, I tried to write a comment for and I was wondering why not
> returning '' instead of nil in string.

Ten years after your question, I stumbled upon a SO answer [1] that seems to
explain this behavior. Apparently `string` returns a string-like object
*that retains tree navigation ability* e.g. sibling, parent, where `text`
returns a bare string. From this perspective, `nil` seems appropriate to
indicate that such a navigable object was not available. That said, it
certainly can make the library harder to deal with. What about a Null Object
Pattern that returns an object polymorphic with SoupString and that
implements appropriate no-ops?

1. https://stackoverflow.com/a/25328374

-----
Cheers,
Sean
--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Cheers,
Sean