What is the most actual HTML parser to date? Researching gave me [1]. All of the discussions were in 2010. So the situation might have changed. Anything newer? better? Anything petit parsers based?
thanks, Norbert [1] http://www.squeaksource.com/htmlcssparser.html |
http://www.squeaksource.com/Soup.html
Is a port of the Beautiful Soup Python library. I don't know if it runs on the latest Pharos though... On 17/01/13 09:49, Norbert Hartl wrote: > What is the most actual HTML parser to date? Researching gave me [1]. All of the discussions were in 2010. So the situation might have changed. Anything newer? better? Anything petit parsers based? > > thanks, > > Norbert > > [1] http://www.squeaksource.com/htmlcssparser.html > -- http://tulipemoutarde.be +32 65 709 131 |
Administrator
|
Def works in 1.4... Soup is a must if you may have to deal with ill-formed HTML (i.e. web scraping in general) because it's the only library I know of that handles it robustly. I've used it a lot and it's pretty straightforward. HTH, Sean
Cheers,
Sean |
On 01/17/2013 11:38 PM, Sean P. DeNigris wrote:
> fstephany wrote >> http://www.squeaksource.com/Soup.html > > Def works in 1.4... Soup is a must if you may have to deal with ill-formed > HTML (i.e. web scraping in general) because it's the only library I know of > that handles it robustly. I've used it a lot and it's pretty > straightforward. Not sure if it works in Pharo etc, but this one was very good earlier at dealing with baaad HTML: http://www.squeaksource.com/htmlcssparser.html I used it in a scraper robot successfully. regards, Göran |
In reply to this post by Sean P. DeNigris
Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>: > fstephany wrote >> http://www.squeaksource.com/Soup.html > > Def works in 1.4... Soup is a must if you may have to deal with ill-formed > HTML (i.e. web scraping in general) because it's the only library I know of > that handles it robustly. I've used it a lot and it's pretty > straightforward. > Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness? Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look. thanks, Norbert |
Norbert
soup produces a limited ast containing all the information and not really a ful and nice AST for html Stef > > Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>: > >> fstephany wrote >>> http://www.squeaksource.com/Soup.html >> >> Def works in 1.4... Soup is a must if you may have to deal with ill-formed >> HTML (i.e. web scraping in general) because it's the only library I know of >> that handles it robustly. I've used it a lot and it's pretty >> straightforward. >> > Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness? > Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look. > > thanks, > > Norbert > > |
Am 19.01.2013 um 09:45 schrieb Stéphane Ducasse <[hidden email]>: > Norbert > > soup produces a limited ast containing all the information and not really a ful and nice AST for html > Thanks! If something like an AST is emitted that will help a lot. I don't need a full HTML AST right now (well, at the moment I think I don't). But pharo will need one in the mid-term, right? thanks, Norbert > >> >> Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>: >> >>> fstephany wrote >>>> http://www.squeaksource.com/Soup.html >>> >>> Def works in 1.4... Soup is a must if you may have to deal with ill-formed >>> HTML (i.e. web scraping in general) because it's the only library I know of >>> that handles it robustly. I've used it a lot and it's pretty >>> straightforward. >> Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness? >> Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look. >> >> thanks, >> >> Norbert > > |
On Jan 19, 2013, at 11:59 AM, Norbert Hartl wrote: > > > Am 19.01.2013 um 09:45 schrieb Stéphane Ducasse <[hidden email]>: > >> Norbert >> >> soup produces a limited ast containing all the information and not really a ful and nice AST for html >> > Thanks! If something like an AST is emitted that will help a lot. I don't need a full HTML AST right now (well, at the moment I think I don't). we soup you get a ast and a query system I used it to scrap magic cards :) > But pharo will need one in the mid-term, right? yes it would be good. > > thanks, > > Norbert >> >>> >>> Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>: >>> >>>> fstephany wrote >>>>> http://www.squeaksource.com/Soup.html >>>> >>>> Def works in 1.4... Soup is a must if you may have to deal with ill-formed >>>> HTML (i.e. web scraping in general) because it's the only library I know of >>>> that handles it robustly. I've used it a lot and it's pretty >>>> straightforward. >>> Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness? >>> Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look. >>> >>> thanks, >>> >>> Norbert >> >> > |
Free forum by Nabble | Edit this page |