HTML parser

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

HTML parser

NorbertHartl
What is the most actual HTML parser to date? Researching gave me [1]. All of the discussions were in 2010. So the situation might have changed. Anything newer? better? Anything petit parsers based?

thanks,

Norbert

[1] http://www.squeaksource.com/htmlcssparser.html
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

fstephany
http://www.squeaksource.com/Soup.html

Is a port of the Beautiful Soup Python library. I don't know if it runs
on the latest Pharos though...

On 17/01/13 09:49, Norbert Hartl wrote:
> What is the most actual HTML parser to date? Researching gave me [1]. All of the discussions were in 2010. So the situation might have changed. Anything newer? better? Anything petit parsers based?
>
> thanks,
>
> Norbert
>
> [1] http://www.squeaksource.com/htmlcssparser.html
>

--
http://tulipemoutarde.be
+32 65 709 131

Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

Sean P. DeNigris
Administrator
Def works in 1.4... Soup is a must if you may have to deal with ill-formed HTML (i.e. web scraping in general) because it's the only library I know of that handles it robustly. I've used it a lot and it's pretty straightforward.

HTH,
Sean
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

Göran Krampe
On 01/17/2013 11:38 PM, Sean P. DeNigris wrote:
> fstephany wrote
>> http://www.squeaksource.com/Soup.html
>
> Def works in 1.4... Soup is a must if you may have to deal with ill-formed
> HTML (i.e. web scraping in general) because it's the only library I know of
> that handles it robustly. I've used it a lot and it's pretty
> straightforward.

Not sure if it works in Pharo etc, but this one was very good earlier at
dealing with baaad HTML:

http://www.squeaksource.com/htmlcssparser.html

I used it in a scraper robot successfully.

regards, Göran


Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

NorbertHartl
In reply to this post by Sean P. DeNigris

Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>:

> fstephany wrote
>> http://www.squeaksource.com/Soup.html
>
> Def works in 1.4... Soup is a must if you may have to deal with ill-formed
> HTML (i.e. web scraping in general) because it's the only library I know of
> that handles it robustly. I've used it a lot and it's pretty
> straightforward.
>
Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness?
Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look.

thanks,

Norbert


Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

Stéphane Ducasse
Norbert

soup produces a limited ast containing all the information and not really a ful and nice AST for html

Stef


>
> Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>:
>
>> fstephany wrote
>>> http://www.squeaksource.com/Soup.html
>>
>> Def works in 1.4... Soup is a must if you may have to deal with ill-formed
>> HTML (i.e. web scraping in general) because it's the only library I know of
>> that handles it robustly. I've used it a lot and it's pretty
>> straightforward.
>>
> Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness?
> Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look.
>
> thanks,
>
> Norbert
>
>


Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

NorbertHartl


Am 19.01.2013 um 09:45 schrieb Stéphane Ducasse <[hidden email]>:

> Norbert
>
> soup produces a limited ast containing all the information and not really a ful and nice AST for html
>
Thanks! If something like an AST is emitted that will help a lot. I don't need a full HTML AST right now (well, at the moment I think I don't). But pharo will need one in the mid-term, right?

thanks,

Norbert

>
>>
>> Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>:
>>
>>> fstephany wrote
>>>> http://www.squeaksource.com/Soup.html
>>>
>>> Def works in 1.4... Soup is a must if you may have to deal with ill-formed
>>> HTML (i.e. web scraping in general) because it's the only library I know of
>>> that handles it robustly. I've used it a lot and it's pretty
>>> straightforward.
>> Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness?
>> Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look.
>>
>> thanks,
>>
>> Norbert
>
>

Reply | Threaded
Open this post in threaded view
|

Re: HTML parser

Stéphane Ducasse

On Jan 19, 2013, at 11:59 AM, Norbert Hartl wrote:

>
>
> Am 19.01.2013 um 09:45 schrieb Stéphane Ducasse <[hidden email]>:
>
>> Norbert
>>
>> soup produces a limited ast containing all the information and not really a ful and nice AST for html
>>
> Thanks! If something like an AST is emitted that will help a lot. I don't need a full HTML AST right now (well, at the moment I think I don't).

we soup you get a ast and a query system I used it to scrap magic cards :)

> But pharo will need one in the mid-term, right?

yes it would be good.

>
> thanks,
>
> Norbert
>>
>>>
>>> Am 17.01.2013 um 23:38 schrieb Sean P. DeNigris <[hidden email]>:
>>>
>>>> fstephany wrote
>>>>> http://www.squeaksource.com/Soup.html
>>>>
>>>> Def works in 1.4... Soup is a must if you may have to deal with ill-formed
>>>> HTML (i.e. web scraping in general) because it's the only library I know of
>>>> that handles it robustly. I've used it a lot and it's pretty
>>>> straightforward.
>>> Ok, thanks for the update. I'm not sure handling ill-formedness is a major requirement but it is good to have. Do you know if HTML5 would be handled as ill-formedness?
>>> Apart from that I'm interested if kind of a document model is emitted or what it does. Well, I'll have a look.
>>>
>>> thanks,
>>>
>>> Norbert
>>
>>
>