Parsing HTML Recommendation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing HTML Recommendation

Kurt Thams
Of the various Squeak parsers that will parse HTML, does anyone have any
recommendations as to what is the best at this point?


Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

Klaus D. Witzel
The one I use is

- http://www.squeaksource.com/htmlcssparser.html

it's interpreting the DTDs from W3C and handles bad HTML and CSS.

/Klaus

On Thu, 23 Aug 2007 21:02:12 +0200, Kurt Thams wrote:

> Of the various Squeak parsers that will parse HTML, does anyone have any  
> recommendations as to what is the best at this point?
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

Jimmie Houchin-3
In reply to this post by Kurt Thams
Kurt Thams wrote:
> Of the various Squeak parsers that will parse HTML, does anyone have any
> recommendations as to what is the best at this point?

Hello,

I too was going to post about this.

I have been working with both HTML-Parser written by Julian Fitzell and
also Html+CSS Validating Parser by Todd Blanchard. (according to the
SqueakMap entries)

I've also been using BeautifulSoup which is an excellent Python html parser.

I would prefer to use Squeak but BeautifulSoup so far has a few features
that has made my job easier. Just a few things I haven't quite worked
out as easy in either of the Squeak tools. Now I know someone proficient
in Squeak could have done it just as easily.

But these two lines give me the headers of my table's columns.
     itemlist = soup.find('table', id=True)
       #gives me the only table with an ID
     headers = itemlist.findAll('th')
       #gives me the headers of that table.

and to parse the table rows with recursing through the nested tables.
     rows = mytable.findAll('td', recursive=False)

The html is broken and has hundreds of tables. There are something like
6 nested tables in each of the primary tables rows. This is from a MS
SharePoint website. The markup is awful.

I'm sure there is an easy way in Squeak to do the above, but I haven't
spent enough time to master it.

A problem I've had with both of the above and which makes them a problem
for me, is that they have both popped up modal dialogs which I had to
click on in order to proceed.

They have fairly different APIs.

The HTML-Parser popped up a box for every tag without a closing tag.
The Html+CSS Validator popped a box it seemed when it couldn't connect
to a site. I guess it was attempting to retrieve the CSS, in order to
validate?

I would love to know if there is a way to silence the dialogs while
proceeding through the parsing. Yes, I know the markup is awful. Do the
best you can and it may be good enough for me to do my job. Apparently I
can do that if if I click, click, click. But I would just like to just
doit.

I've been doing the work in Python/BeautifulSoup but do not enjoy the
dead system and would rather be working in a live environment.

Hopefully, I'll get smart enough to use Squeak effectively. :)

Any wisdom on this subject greatly appreciated. I have lots of html
scraping to do. I don't really need or care about validation and a rich
interface for this purpose would be great.

Thanks,

Jimmie


Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

Matthias Berth-2
There is a way to provide predefined answers to the popup dialogs, the
Installer package does that, see the bottom of:
  http://wiki.squeak.org/squeak/742

Chasing around a bit in the image leads me to:

BlockContext>>valueSuppressingMessages: aListOfStrings
supplyingAnswers: aListOfPairs

BlockContext>>valueSupplyingAnswers: aListOfPairs
        "evaluate the block using a list of questions / answers that might be
called upon to
        automatically respond to Object>>confirm: or FillInTheBlank requests"

So I guess you have a block with whatever you want to do and some
predefined answers:

[ myUrl extractInfo
  "This may trigger some popups"
] valueSuppressingMessages: {'css error*'. 'doctype error'}
supplyingAnswers: {{'input doctype'. 'XHTML 4.1'}}

The first list contains string patterns for messages that will be
answered with OK, the second is a list of pairs of the form (pattern,
answer) where pattern is a string and answer is any object.

I hope this helps


Matthias

On 8/23/07, Jimmie Houchin <[hidden email]> wrote:

> I would love to know if there is a way to silence the dialogs while
> proceeding through the parsing. Yes, I know the markup is awful. Do the
> best you can and it may be good enough for me to do my job. Apparently I
> can do that if if I click, click, click. But I would just like to just
> doit.
>
> I've been doing the work in Python/BeautifulSoup but do not enjoy the
> dead system and would rather be working in a live environment.
>
> Hopefully, I'll get smart enough to use Squeak effectively. :)
>
> Any wisdom on this subject greatly appreciated. I have lots of html
> scraping to do. I don't really need or care about validation and a rich
> interface for this purpose would be great.
>
> Thanks,
>
> Jimmie
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

Jimmie Houchin-3
Matthias Berth wrote:

> There is a way to provide predefined answers to the popup dialogs, the
> Installer package does that, see the bottom of:
>   http://wiki.squeak.org/squeak/742
>
> Chasing around a bit in the image leads me to:
>
> BlockContext>>valueSuppressingMessages: aListOfStrings
> supplyingAnswers: aListOfPairs
>
> BlockContext>>valueSupplyingAnswers: aListOfPairs
> "evaluate the block using a list of questions / answers that might be
> called upon to
> automatically respond to Object>>confirm: or FillInTheBlank requests"
>
> So I guess you have a block with whatever you want to do and some
> predefined answers:
>
> [ myUrl extractInfo
>   "This may trigger some popups"
> ] valueSuppressingMessages: {'css error*'. 'doctype error'}
> supplyingAnswers: {{'input doctype'. 'XHTML 4.1'}}
>
> The first list contains string patterns for messages that will be
> answered with OK, the second is a list of pairs of the form (pattern,
> answer) where pattern is a string and answer is any object.
>
> I hope this helps
>
>
> Matthias

Hello Matthias,

Wow! Thanks for the information. Squeak is full of treasures like this.
Part of the trick of mastering Squeak is learning how to discover all of
these treasures. :)

This will definitely be of tremendous help. I really, really, really
want to use Squeak for as much as I can. But a popup was going to be a
big stumbling block. I was kind of looking in the classes to see if
there was another way to use them that I was missing. Something that
doitAllAndAnswerYes or something. :)

This is great. I can write my own doitAll method. And for so many other
times that this pops up.

This helps tremendously. Now I can dig into Todd's Html+Css validator
and master it and if I want some convenience methods to make things web
scraping easier. I can write them. :)

This is a great help. Thank you very, very much.

Have a great one.

Jimmie

[snip original message]

Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

tblanchard
But these two lines give me the headers of my table's columns.
     itemlist = soup.find('table', id=True)
       #gives me the only table with an ID
     headers = itemlist.findAll('th')
       #gives me the headers of that table.

and to parse the table rows with recursing through the nested tables.
     rows = mytable.findAll('td', recursive=False)


In the HTML CSS parser - you want to look at tagsNamed: 

for instance - dom tagsNamed: 'table'
will return a collection of table nodes that are children of the receiver.

Look at the implementation of that in HtmlDOMNode - it uses a method called nodesCollect: 
that will take an arbitrary block and returns all subnodes for which the block evaluates to true. It is very similar.

The html is broken and has hundreds of tables. There are something like 
6 nested tables in each of the primary tables rows. This is from a MS 
SharePoint website. The markup is awful.

HtmlCSSParser was designed to deal with just such markup (and tries to explain what is wrong with it).

I'm sure there is an easy way in Squeak to do the above, but I haven't 
spent enough time to master it.

A problem I've had with both of the above and which makes them a problem 
for me, is that they have both popped up modal dialogs which I had to 
click on in order to proceed.

They have fairly different APIs.

The HTML-Parser popped up a box for every tag without a closing tag.
The Html+CSS Validator popped a box it seemed when it couldn't connect 
to a site. I guess it was attempting to retrieve the CSS, in order to 
validate?

That would be the underlying transport layer - HtmlCSSParser never tries to interact with the user.

You don't have to validate.

dom := (HtmlValidator onUrl: 'http://something.com') dom.

Cheers,
-Todd


Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

Jimmie Houchin-3
Hello Todd,

Thanks for the reply.

Todd Blanchard wrote:

>> But these two lines give me the headers of my table's columns.
>>      itemlist = soup.find('table', id=True)
>>        #gives me the only table with an ID
>>      headers = itemlist.findAll('th')
>>        #gives me the headers of that table.
>>
>> and to parse the table rows with recursing through the nested tables.
>>      rows = mytable.findAll('td', recursive=False)
>
> In the HTML CSS parser - you want to look at tagsNamed:
>
> for instance - dom tagsNamed: 'table'
> will return a collection of table nodes that are children of the receiver.

Yes, I've been doing that. But my problems have been:

1. Out of 1000+ tables I am looking for one which has an 'ID' attribute.
      In BeautifulSoup it is:  bs.findAll('table', id=True)

    I haven't yet figured out how to do that.

2. I haven't spent enough time with your parser yet, but my one table is
a table comprised of 331 rows each with 6 nested tables.

    When I build a dom with the tagsNamed: 'tr',
    Does it return 331 or 1000+ rows?

    I want the 331. I want to be able to understand which 'column' I am
in so that I can build objects out of the data. The columns represent
object attributes. Some of the columns have tables as their td.

> Look at the implementation of that in HtmlDOMNode - it uses a method
> called nodesCollect:
> that will take an arbitrary block and returns all subnodes for which the
> block evaluates to true. It is very similar.

Does this return the 331 or 1000+ rows (nodes)?

>> The html is broken and has hundreds of tables. There are something like
>> 6 nested tables in each of the primary tables rows. This is from a MS
>> SharePoint website. The markup is awful.
>
> HtmlCSSParser was designed to deal with just such markup (and tries to
> explain what is wrong with it).

In my case I am happy that HtmlCSSParser can deal with, but it doesn't
matter what is wrong. I just want the data.

[snip]
>>
>> The HTML-Parser popped up a box for every tag without a closing tag.
>> The Html+CSS Validator popped a box it seemed when it couldn't connect
>> to a site. I guess it was attempting to retrieve the CSS, in order to
>> validate?
>
> That would be the underlying transport layer - HtmlCSSParser never tries
> to interact with the user.

Okay.

> You don't have to validate.
>
> dom := (HtmlValidator onUrl: 'http://something.com') dom.

Okay. This is what I am doing.
dom := (HtmlValidator on: myHtmlString) dom.

But when I got the popups, I thought that the validation was going awry.

Again, thanks for your help. And thank you for providing this tool.

Jimmie


Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

tblanchard

On Aug 25, 2007, at 5:29 AM, Jimmie Houchin wrote:

> Hello Todd,

> Yes, I've been doing that. But my problems have been:
>
> 1. Out of 1000+ tables I am looking for one which has an 'ID'  
> attribute.
>      In BeautifulSoup it is:  bs.findAll('table', id=True)

dom nodesCollect: [:ea | ea tag = 'table' and: [ea id notNil]]

if there is a specific id you want - make it ea id = 'theId'

>    I haven't yet figured out how to do that.
>
> 2. I haven't spent enough time with your parser yet, but my one  
> table is a table comprised of 331 rows each with 6 nested tables.
>
>    When I build a dom with the tagsNamed: 'tr',
>    Does it return 331 or 1000+ rows?

You need to get the right table - then send it the tagsNamed or  
nodesCollect to search within it.  Assuming that there is exactly one  
table in the whole document with an id, you could do this:

rows := (dom nodesCollect: [:each | each tag = 'table' and:[each id  
notNil]]) first tagsNamed: 'tr'

assuming these all contain fields that are plain text - you can get  
the data as a list of lists doing
"convert rows list to list of lists of TD nodes"
data := rows collect: [:row | (row tagsNamed: 'td')].

"convert rows list to list of lists of text - stripping all markup.
rows := rows collect: [:row | row collect: [:cell | String  
streamContents: [:s | (cell nodesCollect: [:n | n isCDATA]) do:  
[:cdata | s nextPutAll: cdata asString]]]]

 From here you can get the text of a cell with
string := (rows at: r) at: c

If your cell is a table itself, lather rinse repeat.

> Okay. This is what I am doing.
> dom := (HtmlValidator on: myHtmlString) dom.
>
> But when I got the popups, I thought that the validation was going  
> awry.

In the interest of performance, the parser fetches CSS files in LINK  
tags by queueing them in a separate thread as soon as the href is  
encountered.  Since you don't need this behavior - go into  
HtmlLINKNode>>parseContents: and comment out the line:

self loader queueUrl: href. "Start download in another thread"

> Again, thanks for your help. And thank you for providing this tool.

I'm glad somebody found it useful.

-Todd

Reply | Threaded
Open this post in threaded view
|

Re: Parsing HTML Recommendation

Jimmie Houchin-3
Hi Todd,

Thanks for the information.

I figured I could probably do it with something like that. But haven't
had the time to dig back into it.

I appreciate the help.

Jimmie

Todd Blanchard wrote:
[big snip of useful information. :) ]