Of the various Squeak parsers that will parse HTML, does anyone have any
recommendations as to what is the best at this point? |
The one I use is
- http://www.squeaksource.com/htmlcssparser.html it's interpreting the DTDs from W3C and handles bad HTML and CSS. /Klaus On Thu, 23 Aug 2007 21:02:12 +0200, Kurt Thams wrote: > Of the various Squeak parsers that will parse HTML, does anyone have any > recommendations as to what is the best at this point? > > > |
In reply to this post by Kurt Thams
Kurt Thams wrote:
> Of the various Squeak parsers that will parse HTML, does anyone have any > recommendations as to what is the best at this point? Hello, I too was going to post about this. I have been working with both HTML-Parser written by Julian Fitzell and also Html+CSS Validating Parser by Todd Blanchard. (according to the SqueakMap entries) I've also been using BeautifulSoup which is an excellent Python html parser. I would prefer to use Squeak but BeautifulSoup so far has a few features that has made my job easier. Just a few things I haven't quite worked out as easy in either of the Squeak tools. Now I know someone proficient in Squeak could have done it just as easily. But these two lines give me the headers of my table's columns. itemlist = soup.find('table', id=True) #gives me the only table with an ID headers = itemlist.findAll('th') #gives me the headers of that table. and to parse the table rows with recursing through the nested tables. rows = mytable.findAll('td', recursive=False) The html is broken and has hundreds of tables. There are something like 6 nested tables in each of the primary tables rows. This is from a MS SharePoint website. The markup is awful. I'm sure there is an easy way in Squeak to do the above, but I haven't spent enough time to master it. A problem I've had with both of the above and which makes them a problem for me, is that they have both popped up modal dialogs which I had to click on in order to proceed. They have fairly different APIs. The HTML-Parser popped up a box for every tag without a closing tag. The Html+CSS Validator popped a box it seemed when it couldn't connect to a site. I guess it was attempting to retrieve the CSS, in order to validate? I would love to know if there is a way to silence the dialogs while proceeding through the parsing. Yes, I know the markup is awful. Do the best you can and it may be good enough for me to do my job. Apparently I can do that if if I click, click, click. But I would just like to just doit. I've been doing the work in Python/BeautifulSoup but do not enjoy the dead system and would rather be working in a live environment. Hopefully, I'll get smart enough to use Squeak effectively. :) Any wisdom on this subject greatly appreciated. I have lots of html scraping to do. I don't really need or care about validation and a rich interface for this purpose would be great. Thanks, Jimmie |
There is a way to provide predefined answers to the popup dialogs, the
Installer package does that, see the bottom of: http://wiki.squeak.org/squeak/742 Chasing around a bit in the image leads me to: BlockContext>>valueSuppressingMessages: aListOfStrings supplyingAnswers: aListOfPairs BlockContext>>valueSupplyingAnswers: aListOfPairs "evaluate the block using a list of questions / answers that might be called upon to automatically respond to Object>>confirm: or FillInTheBlank requests" So I guess you have a block with whatever you want to do and some predefined answers: [ myUrl extractInfo "This may trigger some popups" ] valueSuppressingMessages: {'css error*'. 'doctype error'} supplyingAnswers: {{'input doctype'. 'XHTML 4.1'}} The first list contains string patterns for messages that will be answered with OK, the second is a list of pairs of the form (pattern, answer) where pattern is a string and answer is any object. I hope this helps Matthias On 8/23/07, Jimmie Houchin <[hidden email]> wrote: > I would love to know if there is a way to silence the dialogs while > proceeding through the parsing. Yes, I know the markup is awful. Do the > best you can and it may be good enough for me to do my job. Apparently I > can do that if if I click, click, click. But I would just like to just > doit. > > I've been doing the work in Python/BeautifulSoup but do not enjoy the > dead system and would rather be working in a live environment. > > Hopefully, I'll get smart enough to use Squeak effectively. :) > > Any wisdom on this subject greatly appreciated. I have lots of html > scraping to do. I don't really need or care about validation and a rich > interface for this purpose would be great. > > Thanks, > > Jimmie > > > |
Matthias Berth wrote:
> There is a way to provide predefined answers to the popup dialogs, the > Installer package does that, see the bottom of: > http://wiki.squeak.org/squeak/742 > > Chasing around a bit in the image leads me to: > > BlockContext>>valueSuppressingMessages: aListOfStrings > supplyingAnswers: aListOfPairs > > BlockContext>>valueSupplyingAnswers: aListOfPairs > "evaluate the block using a list of questions / answers that might be > called upon to > automatically respond to Object>>confirm: or FillInTheBlank requests" > > So I guess you have a block with whatever you want to do and some > predefined answers: > > [ myUrl extractInfo > "This may trigger some popups" > ] valueSuppressingMessages: {'css error*'. 'doctype error'} > supplyingAnswers: {{'input doctype'. 'XHTML 4.1'}} > > The first list contains string patterns for messages that will be > answered with OK, the second is a list of pairs of the form (pattern, > answer) where pattern is a string and answer is any object. > > I hope this helps > > > Matthias Hello Matthias, Wow! Thanks for the information. Squeak is full of treasures like this. Part of the trick of mastering Squeak is learning how to discover all of these treasures. :) This will definitely be of tremendous help. I really, really, really want to use Squeak for as much as I can. But a popup was going to be a big stumbling block. I was kind of looking in the classes to see if there was another way to use them that I was missing. Something that doitAllAndAnswerYes or something. :) This is great. I can write my own doitAll method. And for so many other times that this pops up. This helps tremendously. Now I can dig into Todd's Html+Css validator and master it and if I want some convenience methods to make things web scraping easier. I can write them. :) This is a great help. Thank you very, very much. Have a great one. Jimmie [snip original message] |
In the HTML CSS parser - you want to look at tagsNamed: for instance - dom tagsNamed: 'table' will return a collection of table nodes that are children of the receiver. Look at the implementation of that in HtmlDOMNode - it uses a method called nodesCollect: that will take an arbitrary block and returns all subnodes for which the block evaluates to true. It is very similar.
HtmlCSSParser was designed to deal with just such markup (and tries to explain what is wrong with it).
That would be the underlying transport layer - HtmlCSSParser never tries to interact with the user. |
Hello Todd,
Thanks for the reply. Todd Blanchard wrote: >> But these two lines give me the headers of my table's columns. >> itemlist = soup.find('table', id=True) >> #gives me the only table with an ID >> headers = itemlist.findAll('th') >> #gives me the headers of that table. >> >> and to parse the table rows with recursing through the nested tables. >> rows = mytable.findAll('td', recursive=False) > > In the HTML CSS parser - you want to look at tagsNamed: > > for instance - dom tagsNamed: 'table' > will return a collection of table nodes that are children of the receiver. Yes, I've been doing that. But my problems have been: 1. Out of 1000+ tables I am looking for one which has an 'ID' attribute. In BeautifulSoup it is: bs.findAll('table', id=True) I haven't yet figured out how to do that. 2. I haven't spent enough time with your parser yet, but my one table is a table comprised of 331 rows each with 6 nested tables. When I build a dom with the tagsNamed: 'tr', Does it return 331 or 1000+ rows? I want the 331. I want to be able to understand which 'column' I am in so that I can build objects out of the data. The columns represent object attributes. Some of the columns have tables as their td. > Look at the implementation of that in HtmlDOMNode - it uses a method > called nodesCollect: > that will take an arbitrary block and returns all subnodes for which the > block evaluates to true. It is very similar. Does this return the 331 or 1000+ rows (nodes)? >> The html is broken and has hundreds of tables. There are something like >> 6 nested tables in each of the primary tables rows. This is from a MS >> SharePoint website. The markup is awful. > > HtmlCSSParser was designed to deal with just such markup (and tries to > explain what is wrong with it). In my case I am happy that HtmlCSSParser can deal with, but it doesn't matter what is wrong. I just want the data. [snip] >> >> The HTML-Parser popped up a box for every tag without a closing tag. >> The Html+CSS Validator popped a box it seemed when it couldn't connect >> to a site. I guess it was attempting to retrieve the CSS, in order to >> validate? > > That would be the underlying transport layer - HtmlCSSParser never tries > to interact with the user. Okay. > You don't have to validate. > > dom := (HtmlValidator onUrl: 'http://something.com') dom. Okay. This is what I am doing. dom := (HtmlValidator on: myHtmlString) dom. But when I got the popups, I thought that the validation was going awry. Again, thanks for your help. And thank you for providing this tool. Jimmie |
On Aug 25, 2007, at 5:29 AM, Jimmie Houchin wrote: > Hello Todd, > Yes, I've been doing that. But my problems have been: > > 1. Out of 1000+ tables I am looking for one which has an 'ID' > attribute. > In BeautifulSoup it is: bs.findAll('table', id=True) dom nodesCollect: [:ea | ea tag = 'table' and: [ea id notNil]] if there is a specific id you want - make it ea id = 'theId' > I haven't yet figured out how to do that. > > 2. I haven't spent enough time with your parser yet, but my one > table is a table comprised of 331 rows each with 6 nested tables. > > When I build a dom with the tagsNamed: 'tr', > Does it return 331 or 1000+ rows? You need to get the right table - then send it the tagsNamed or nodesCollect to search within it. Assuming that there is exactly one table in the whole document with an id, you could do this: rows := (dom nodesCollect: [:each | each tag = 'table' and:[each id notNil]]) first tagsNamed: 'tr' assuming these all contain fields that are plain text - you can get the data as a list of lists doing "convert rows list to list of lists of TD nodes" data := rows collect: [:row | (row tagsNamed: 'td')]. "convert rows list to list of lists of text - stripping all markup. rows := rows collect: [:row | row collect: [:cell | String streamContents: [:s | (cell nodesCollect: [:n | n isCDATA]) do: [:cdata | s nextPutAll: cdata asString]]]] From here you can get the text of a cell with string := (rows at: r) at: c If your cell is a table itself, lather rinse repeat. > Okay. This is what I am doing. > dom := (HtmlValidator on: myHtmlString) dom. > > But when I got the popups, I thought that the validation was going > awry. In the interest of performance, the parser fetches CSS files in LINK tags by queueing them in a separate thread as soon as the href is encountered. Since you don't need this behavior - go into HtmlLINKNode>>parseContents: and comment out the line: self loader queueUrl: href. "Start download in another thread" > Again, thanks for your help. And thank you for providing this tool. I'm glad somebody found it useful. -Todd |
Hi Todd,
Thanks for the information. I figured I could probably do it with something like that. But haven't had the time to dig back into it. I appreciate the help. Jimmie Todd Blanchard wrote: [big snip of useful information. :) ] |
Free forum by Nabble | Edit this page |