HTML Parser w. custom nodes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML Parser w. custom nodes

Sean P. DeNigris
Administrator
Now that we have the cool new Catalog Browser, I see we have at least 4 HTML parsing options - cool! Do any of these allow one to inject custom node classes? Kind of like Zincs converter support, but for individual nodes. When using Soup, I've often thought something like, "gee I wish I could have told it that a tr with a certain background color was really a ProjectDescriptionRow!". It seems clunky to have to parse it once, and then query it again (usually procedurally) to make domain sense of the data, no?
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser w. custom nodes

Peter Kenny
Sean

XMLHTMLParser is a subclass of XMLDOMParser, which allows the specification
of a node factory to provide custom handling of nodes. Depending on what you
want to achieve, this might help.

Best wishes

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of
Sean P. DeNigris
Sent: 12 June 2015 23:45
To: [hidden email]
Subject: [Pharo-users] HTML Parser w. custom nodes

Now that we have the cool new Catalog Browser, I see we have at least 4 HTML
parsing options - cool! Do any of these allow one to inject custom node
classes? Kind of like Zincs converter support, but for individual nodes.
When using Soup, I've often thought something like, "gee I wish I could have
told it that a tr with a certain background color was really a
ProjectDescriptionRow!". It seems clunky to have to parse it once, and then
query it again (usually procedurally) to make domain sense of the data, no?



-----
Cheers,
Sean
--
View this message in context:
http://forum.world.st/HTML-Parser-w-custom-nodes-tp4832169.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser w. custom nodes

Sean P. DeNigris
Administrator
Peter Kenny wrote
allows the specification of a node factory to provide custom handling of nodes
Sound like exactly what I had in mind! Although, now that I've played with it, I realize that most of the web scraping use cases that I've encountered seem like they could really benefit from a PetitParser style tool, where one could specify the output for each rule (e.g. `dataRowRule ==> [ "code translating the node/element into a domain object" ]`, and so parse the markup directly to domain objects...
Cheers,
Sean