HTML parser in GST

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML parser in GST

Andrei Stebakov
I've noticed there are a number of XML parsers in the package. I
wonder if I can use it as an HTML parser (similar to Soup
http://news.squeak.org/2009/01/19/soup-for-squeak/)
Are there any examples using it? The task is a simple web page
retrieval and parsing, hunting for some tag with a value.

Thank you,
Andrei

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser in GST

Holger Freyther
On 06/04/2010 11:56 PM, Andrei Stebakov wrote:
> I've noticed there are a number of XML parsers in the package. I
> wonder if I can use it as an HTML parser (similar to Soup
> http://news.squeak.org/2009/01/19/soup-for-squeak/)
> Are there any examples using it? The task is a simple web page
> retrieval and parsing, hunting for some tag with a value.

Well,
HTML parsers are a funny thing... the best thing to do is to use the
HTML5 parser specification and implement it from scratch, to my
knowledge it is the first time that there is a specification on how to
handle missing tags (e.g. how many elements to close, aka tag priorities).

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser in GST

Paolo Bonzini-2
On 06/04/2010 06:11 PM, Holger Hans Peter Freyther wrote:
> On 06/04/2010 11:56 PM, Andrei Stebakov wrote:
>> I've noticed there are a number of XML parsers in the package. I
>> wonder if I can use it as an HTML parser (similar to Soup
>> http://news.squeak.org/2009/01/19/soup-for-squeak/) Are there any
>> examples using it? The task is a simple web page retrieval and
>> parsing, hunting for some tag with a value.

If Soup has some kind of SAX interface it would be easy to use it to
build the DOM and then query it with XPath.

> Well, HTML parsers are a funny thing... the best thing to do is to
> use the HTML5 parser specification and implement it from scratch, to
> my knowledge it is the first time that there is a specification on
> how to handle missing tags (e.g. how many elements to close, aka tag
> priorities).

Agreed.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk