Hi
I would like extract partial information from HTML pages. So I started to play and fix HTMLParser. Now does anybody know alternatives? Second, does anybody develop a lib to query HTML structure as with the XMLSupport package? Stef |
Administrator
|
I recently got the following very detailed response on the Squeak list - http://forum.world.st/HTML-parser-again-again-td3018595.html
In addition to what's discussed there, I've been successfully using PetitParser with/in lieu of HTML parsers for partial parsing. Cheers, Sean
Cheers,
Sean |
In reply to this post by Stéphane Ducasse
Stef,
Are the html pages produced under your control? I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers. Bill ________________________________________ From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]] Sent: Saturday, December 04, 2010 3:48 PM To: [hidden email] Development Subject: [Pharo-project] Parsing and navigating Html Hi I would like extract partial information from HTML pages. So I started to play and fix HTMLParser. Now does anybody know alternatives? Second, does anybody develop a lib to query HTML structure as with the XMLSupport package? Stef |
In reply to this post by Sean P. DeNigris
SeasideTesting uses the XMLParser for parsing html files and has support for querying them.
It's not complete but, for example, covers quite a lot of cases needed in testing, obviously On 04 Dec 2010, at 22:15, Sean P. DeNigris wrote: > > I recently got the following very detailed response on the Squeak list - > http://forum.world.st/HTML-parser-again-again-td3018595.html > > In addition to what's discussed there, I've been successfully using > PetitParser with/in lieu of HTML parsers for partial parsing. > > Cheers, > Sean > -- > View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html > Sent from the Pharo Smalltalk mailing list archive at Nabble.com. > |
In reply to this post by Sean P. DeNigris
Thanks.
- I looked at HtmlParser (I would like to have a visitor so that all the morphic code for the rendering is removed from the node. - I was planning to look at CSS/parser validator. - I will look at SOUP and webrobots thanks for the pointers. Stef On Dec 4, 2010, at 10:15 PM, Sean P. DeNigris wrote: > > I recently got the following very detailed response on the Squeak list - > http://forum.world.st/HTML-parser-again-again-td3018595.html > > In addition to what's discussed there, I've been successfully using > PetitParser with/in lieu of HTML parsers for partial parsing. > > Cheers, > Sean > -- > View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html > Sent from the Pharo Smalltalk mailing list archive at Nabble.com. > |
In reply to this post by Schwab,Wilhelm K
no just some applications I do not control. :)
On Dec 5, 2010, at 3:21 AM, Schwab,Wilhelm K wrote: > Stef, > > Are the html pages produced under your control? I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers. > > Bill > > > ________________________________________ > From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]] > Sent: Saturday, December 04, 2010 3:48 PM > To: [hidden email] Development > Subject: [Pharo-project] Parsing and navigating Html > > Hi > > I would like extract partial information from HTML pages. > So I started to play and fix HTMLParser. > > Now does anybody know alternatives? > > Second, does anybody develop a lib to query HTML structure as with the XMLSupport package? > > Stef > |
In reply to this post by Johan Brichau-2
On Dec 5, 2010, at 12:13 PM, Johan Brichau wrote: > SeasideTesting uses the XMLParser for parsing html files and has support for querying them. > It's not complete but, for example, covers quite a lot of cases needed in testing, obviously good to know. Now I tried to parse some pages with XMLParser before trying HTMLParser and it did not succeed. Stef > > On 04 Dec 2010, at 22:15, Sean P. DeNigris wrote: > >> >> I recently got the following very detailed response on the Squeak list - >> http://forum.world.st/HTML-parser-again-again-td3018595.html >> >> In addition to what's discussed there, I've been successfully using >> PetitParser with/in lieu of HTML parsers for partial parsing. >> >> Cheers, >> Sean >> -- >> View this message in context: http://forum.world.st/Parsing-and-navigating-Html-tp3072743p3072777.html >> Sent from the Pharo Smalltalk mailing list archive at Nabble.com. >> > > |
In reply to this post by Stéphane Ducasse
On 12/05/2010 06:02 AM, [hidden email] wrote:
> From: St?phane Ducasse<[hidden email]> > Subject: [Pharo-project] Parsing and navigating Html > To:"[hidden email] Development" > <[hidden email]> > Message-ID:<[hidden email]> > Content-Type: text/plain; charset=us-ascii > > Hi > > I would like extract partial information from HTML pages. > So I started to play and fix HTMLParser. > > Now does anybody know alternatives? > > Second, does anybody develop a lib to query HTML structure as with the XMLSupport package? > > Stef The Soup package on SqueakSource.com works OK. |
In reply to this post by Stéphane Ducasse
Stef,
Do you at least have a list of specific applications producing the data? Do you know and trust the people running them? That might be a nice position compared to data mining the net (like Google) where you have no idea what might arrive, how well it might comply with standards, all the time having to worry about malicious intent. For example, I do not recommend putting untrusted data through a Microsoft parsing library ~:0 Bill ________________________________________ From: [hidden email] [[hidden email]] On Behalf Of Stéphane Ducasse [[hidden email]] Sent: Sunday, December 05, 2010 10:21 AM To: [hidden email] Subject: Re: [Pharo-project] Parsing and navigating Html no just some applications I do not control. :) On Dec 5, 2010, at 3:21 AM, Schwab,Wilhelm K wrote: > Stef, > > Are the html pages produced under your control? I ask because a lot of html on the net is potentially mal-formed, which is reportedly a big source of trouble for creators of web browsers. > > Bill > |
In reply to this post by Paul DeBruicker
thanks paul I will have a look now.
Stef > On 12/05/2010 06:02 AM, [hidden email] wrote: >> From: St?phane Ducasse<[hidden email]> >> Subject: [Pharo-project] Parsing and navigating Html >> To:"[hidden email] Development" >> <[hidden email]> >> Message-ID:<[hidden email]> >> Content-Type: text/plain; charset=us-ascii >> >> Hi >> >> I would like extract partial information from HTML pages. >> So I started to play and fix HTMLParser. >> >> Now does anybody know alternatives? >> >> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package? >> >> Stef > > > The Soup package on SqueakSource.com works OK. > |
In reply to this post by Paul DeBruicker
Hi paul
There is not one single class comments :( Do you have an example? I got two tests red, it is normal? I'm sad because this is really not like that, that we will attract people to Smalltalk. Stef On Dec 5, 2010, at 5:19 PM, Paul DeBruicker wrote: > On 12/05/2010 06:02 AM, [hidden email] wrote: >> From: St?phane Ducasse<[hidden email]> >> Subject: [Pharo-project] Parsing and navigating Html >> To:"[hidden email] Development" >> <[hidden email]> >> Message-ID:<[hidden email]> >> Content-Type: text/plain; charset=us-ascii >> >> Hi >> >> I would like extract partial information from HTML pages. >> So I started to play and fix HTMLParser. >> >> Now does anybody know alternatives? >> >> Second, does anybody develop a lib to query HTML structure as with the XMLSupport package? >> >> Stef > > > The Soup package on SqueakSource.com works OK. > |
In reply to this post by Paul DeBruicker
Hi paul
I started to write some comments and more tests as well as writing a small Documentation. May be to be turned into a Help later. Now a question, I tried to write a comment for and I was wondering why not returning '' instead of nil in string. It will avoid to have isNil everywhere. I did not touch that code for now. string ^ (children size = 1 and: [(children at: 1) isString]) ifTrue: [(children at: 1) contents] ifFalse: [nil] testString | soup | soup := Soup fromString: '<b>foo</b>'. self assert: (soup b string = 'foo'). soup := Soup fromString: '<b>f<i>e</i>o</b>'. self assert: (soup b string) isNil Stef |
Administrator
|
Stéphane Ducasse wrote
> Now a question, I tried to write a comment for and I was wondering why not > returning '' instead of nil in string. Ten years after your question, I stumbled upon a SO answer [1] that seems to explain this behavior. Apparently `string` returns a string-like object *that retains tree navigation ability* e.g. sibling, parent, where `text` returns a bare string. From this perspective, `nil` seems appropriate to indicate that such a navigable object was not available. That said, it certainly can make the library harder to deal with. What about a Null Object Pattern that returns an object polymorphic with SoupString and that implements appropriate no-ops? 1. https://stackoverflow.com/a/25328374 ----- Cheers, Sean -- Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html
Cheers,
Sean |
Free forum by Nabble | Edit this page |