Hi All
In another context (news:[hidden email]... in thread '"Failed to open walkback dialog" due to exhausted window handles?') I mentioned that I have partially ported the Squeak HTML parser to Dolphin, but wanted to do some further testing before releasing it to anyone else. I have now done as much testing as I propose to do, so I shall pass it on to Bill Schwab and take up his offer to mount it on his web site (if he is still interested and has survived Ivan). My tests have not been concerned with whether the Squeak code has been ported correctly - I think I had established that some days ago - but with whether the Squeak code correctly parses the input. The main test I used was to parse an original HTML file, send #asHtml to the resulting HtmlDocument object and display the output in Internet Explorer - if the parse is correct the result should be close to that obtained from the original. With the code as in Squeak the results could be very different; with a few changes I have made they are closer, though not yet identical. I think with some further work the parsing rules could be modified to give even better results, but what I now have is good enough for my purposes. Wondering what it could be useful for, I thought about Chris Uppal's wish for a lightweight parser and renderer. This is certainly lightweight (the package size is 134KB, while the MSHTML package is nearly 10MB!), but it is not a renderer - getting the formatted output to work would have meant importing a lot from the Squeak output side, and I did not need that hassle. My immediate need was to deconstruct web pages and get convenient access to components - the text of a web newspaper story, for example, or the table of recent updates from the AVG free virus checker - so that I may use them elsewhere in Dolphin programs. This I can now do well enough, so I am doing no more work on the parser. However, there is a possibility of using the HtmlEntity classes as a framework for developing HTML pages; as far as I can see the #asHtml method always works correctly, so a coherent structure inside an HtmlDocument should lead straight to valid HTML code. I know there are plenty of web page design programs out there, but a simple one that I can tailor to my needs looks like fun. If anyone is interested in taking this work further, the (rather lengthy) package comment gives the details of what I have done and what I know about what remains to do. I am happy to discuss by e-mail if anyone is interested in more detail. Best wishes Peter Kenny |
"Peter Kenny" <[hidden email]> wrote in message news:<[hidden email]>...
> My tests have not been concerned with whether the Squeak code has been > ported correctly - I think I had established that some days ago - but with > whether the Squeak code correctly parses the input. The main test I used was > to parse an original HTML file, send #asHtml to the resulting HtmlDocument > object and display the output in Internet Explorer - if the parse is correct > the result should be close to that obtained from the original. With the code > as in Squeak the results could be very different; with a few changes I have > made they are closer, though not yet identical. I think with some further > work the parsing rules could be modified to give even better results, but > what I now have is good enough for my purposes. Depending on what you count as "correctly" parsing the input, you may find that the HTML-Parser package on SqueakMap (http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a201ec) works better for you. It follows the the HTML4 spec exactly, which by and large gives much better results for things like legally omitted close tags than the ad-hoc rules in the standard Squeak parser. However, it doesn't try to make any intelligent guesses about truly broken HTML the way browsers do. Avi |
"Avi Bryant" <[hidden email]> wrote in message
news:[hidden email]... > Depending on what you count as "correctly" parsing the input, you may > find that the HTML-Parser package on SqueakMap > (http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a 201ec) > works better for you. It follows the the HTML4 spec exactly, which by > and large gives much better results for things like legally omitted > close tags than the ad-hoc rules in the standard Squeak parser. > However, it doesn't try to make any intelligent guesses about truly > broken HTML the way browsers do. > > Avi Avi Thanks for this. I shall investigate it. I may well stick with the standard parser, since it meets my limited purposes, but it is useful to know what else is available. I wonder whether the Squeak Map parser is what Bill Schwab had in mind when he recommended a Squeak parser to Yar Hwee Bon? - he said it worked well, and the standard Squeak parser certainly makes a lot of errors. Best wishes Peter Kenny |
In reply to this post by Avi Bryant-3
"Avi Bryant" <[hidden email]> wrote in message
news:[hidden email]... > Depending on what you count as "correctly" parsing the input, you may > find that the HTML-Parser package on SqueakMap > (http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a 201ec) > works better for you. Avi I have now got the SqueakMap parser working correctly under Dolphin (at least, all the unit tests work correctly). It certainly parses more of my test pages in a "reasonable" way than the standard Squeak parser did, but it is far from perfect. Every test page I tried produced a string of "DanglingCloseTag" exceptions, and most of these, when investigated, were due to the parser disallowing an inclusion earlier on in the parse. It will allow a <table> to be embedded in a <td>, which the standard parser will not, but it objects to anything other than a <tr> being embedded in a <table>. However, if I fix it to ignore all the exceptions it gives me a parse I can use for my purposes, so I shall continue to use it instead of the standard parser; thanks again for the suggestion. I am not sure whether this is likely to be of interest to anyone else. It would now take a bit of effort to tidy it up and document it, so I shall just keep it for my own use unless anyone signals an interest. Peter Kenny |
"Peter Kenny" <[hidden email]> wrote in message news:<[hidden email]>...
> I have now got the SqueakMap parser working correctly under Dolphin (at > least, all the unit tests work correctly). It certainly parses more of my > test pages in a "reasonable" way than the standard Squeak parser did, but it > is far from perfect. Every test page I tried produced a string of > "DanglingCloseTag" exceptions, and most of these, when investigated, were > due to the parser disallowing an inclusion earlier on in the parse. It will > allow a <table> to be embedded in a <td>, which the standard parser will > not, but it objects to anything other than a <tr> being embedded in a > <table>. Well, presumably it would also allow <thead> and <tbody> - AFAIK those three tags are the only ones that are legal to be directly inside a table. It sounds like you're dealing with some somewhat broken HTML - have you considered running it through something like HTMLtidy before parsing it? At any rate, if you ignore the exceptions, I believe that parser should at least have the property you were looking for, that the output string will always be the same as the input string, no matter how convoluted a tree it has to construct to follow the rules of the spec. |
"Avi Bryant" <[hidden email]> wrote in message
news:[hidden email]... > Well, presumably it would also allow <thead> and <tbody> - AFAIK those > three tags are the only ones that are legal to be directly inside a > table. > > It sounds like you're dealing with some somewhat broken HTML - have > you considered running it through something like HTMLtidy before > parsing it? > > At any rate, if you ignore the exceptions, I believe that parser > should at least have the property you were looking for, that the > output string will always be the same as the input string, no matter > how convoluted a tree it has to construct to follow the rules of the > spec. Avi I have gone carefully through the parse output from one page (actually a story from the newspaper 'Die Welt') to see where the exceptions came from. There were three inclusions that caused most, or maybe all, of the dozen or so exceptions (in each case 'inside' means 'immediately inside'): a. <img> inside <noscript>; b. <form> inside <table> ( with <tr> inside <form> and <td> inside <tr>); c. <meta> inside <td>. These are all probably violations of the HTML rules, but browsers seem to interpret them as valid inclusions. The SqueakMap parser interprets each as a case of a missing end tag for the outer construction, which it therefore closes. So if I print out the parse tree, these end tags appear in the wrong place, and the structure is changed. In case c. above, for example, what followed the <meta> inside the <td> was another <table>; the parser generated end tags for the <td> and its containing <tr> and <table>, and then put the included table as the next element in the outer level construct (actually the <body>). However, none of this matters too much to me as long as it operates consistently (which I suppose means that the page originators use their peculiar constructs consistently :-) ). I just want to be able to find the text of web newspaper stories and handle it inside my own programs, so as long as I know where to look in the parse tree I am OK. As I said some way back, this parser is adequate for my needs, and is certainly much easier to use than the MSHTML stuff. So thanks again for the suggestion. Peter Kenny |
In reply to this post by Peter Kenny-2
Peter (or Bill?), may I know if the port is available online? I interested
in having a look at it, if possible. Thanks. -- Regards HweeBoon MotionObj |
"Yar Hwee Boon" <[hidden email]> wrote in message
news:[hidden email]... > Peter (or Bill?), may I know if the port is available online? I interested > in having a look at it, if possible. Thanks. > > -- > Regards > HweeBoon > MotionObj HweeBon Things have got a bit messy since my original post in this thread, because Avi Bryant pointed out that there is another Squeak parser available for download on SqueakMap. I downloaded that and the necessary support files, and the work I have done recently all uses that. However, the SqueakMap parser also needs the same tokenizer as the standard Squeak parser, which in turn needs some loose methods I imported from Squeak into the Dolphin base classes. To avoid a bit of sorting out, I put everything into the one Dolphin package, which now contains three parsers: a. HtmlParser, the standard parser from the Squeak distribution, but with some parse rules modified by me to avoid rejection of common inclusions; I decided this needs too much more work, and have largely given up on it. b. HTMLParser (note the upper case!), the SqueakMap parser, which I have ported without changing the parse rules, though this also rejects quite a few common inclusions (see my exchange with Avi in this thread). c. XMLParser, also from SqueakMap, which is included because HTMLParser depends on it in various ways - I have not used it directly. Because the situation was still evolving, and I was not yet certain what else I would do with it, I have not yet taken up Bill's offer to mount it on his web site. I am hesitant about letting it out on the world, because it is really work in progress and I am unhappy with its messy state. However, if you would be willing to play with it on that understanding, I could e-mail a copy direct to you. Let me know (either direct or through this group) if you want it. The package is about 250KB, but could of course be zipped. Best wishes Peter |
Free forum by Nabble | Edit this page |