Smalltalk › Usenets › Dolphin Smalltalk

Squeak HTML Parser - Partial port to Dolphin

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

8 messages Options

Peter Kenny-2

Squeak HTML Parser - Partial port to Dolphin

Hi All

In another context (news:[hidden email]... in thread
'"Failed to open walkback dialog" due to exhausted window handles?') I
mentioned that I have partially ported the Squeak HTML parser to Dolphin,
but wanted to do some further testing before releasing it to anyone else. I
have now done as much testing as I propose to do, so I shall pass it on to
Bill Schwab and take up his offer to mount it on his web site (if he is
still interested and has survived Ivan).

My tests have not been concerned with whether the Squeak code has been
ported correctly - I think I had established that some days ago - but with
whether the Squeak code correctly parses the input. The main test I used was
to parse an original HTML file, send #asHtml to the resulting HtmlDocument
object and display the output in Internet Explorer - if the parse is correct
the result should be close to that obtained from the original. With the code
as in Squeak the results could be very different; with a few changes I have
made they are closer, though not yet identical. I think with some further
work the parsing rules could be modified to give even better results, but
what I now have is good enough for my purposes.

Wondering what it could be useful for, I thought about Chris Uppal's wish
for a lightweight parser and renderer. This is certainly lightweight (the
package size is 134KB, while the MSHTML package is nearly 10MB!), but it is
not a renderer - getting the formatted output to work would have meant
importing a lot from the Squeak output side, and I did not need that hassle.
My immediate need was to deconstruct web pages and get convenient access to
components - the text of a web newspaper story, for example, or the table of
recent updates from the AVG free virus checker - so that I may use them
elsewhere in Dolphin programs. This I can now do well enough, so I am doing
no more work on the parser. However, there is a possibility of using the
HtmlEntity classes as a framework for developing HTML pages; as far as I can
see the #asHtml method always works correctly, so a coherent structure
inside an HtmlDocument should lead straight to valid HTML code. I know there
are plenty of web page design programs out there, but a simple one that I
can tailor to my needs looks like fun.

If anyone is interested in taking this work further, the (rather lengthy)
package comment gives the details of what I have done and what I know about
what remains to do. I am happy to discuss by e-mail if anyone is interested
in more detail.

Best wishes

Peter Kenny

Avi Bryant-3

Re: Squeak HTML Parser - Partial port to Dolphin

"Peter Kenny" <[hidden email]> wrote in message news:<[hidden email]>...

> My tests have not been concerned with whether the Squeak code has been
> ported correctly - I think I had established that some days ago - but with
> whether the Squeak code correctly parses the input. The main test I used was
> to parse an original HTML file, send #asHtml to the resulting HtmlDocument
> object and display the output in Internet Explorer - if the parse is correct
> the result should be close to that obtained from the original. With the code
> as in Squeak the results could be very different; with a few changes I have
> made they are closer, though not yet identical. I think with some further
> work the parsing rules could be modified to give even better results, but
> what I now have is good enough for my purposes.

Depending on what you count as "correctly" parsing the input, you may
find that the HTML-Parser package on SqueakMap
(http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a201ec)
works better for you. It follows the the HTML4 spec exactly, which by
and large gives much better results for things like legally omitted
close tags than the ad-hoc rules in the standard Squeak parser.
However, it doesn't try to make any intelligent guesses about truly
broken HTML the way browsers do.

Avi

Peter Kenny-2

Re: Squeak HTML Parser - Partial port to Dolphin

"Avi Bryant" <[hidden email]> wrote in message
news:[hidden email]...

> Depending on what you count as "correctly" parsing the input, you may
> find that the HTML-Parser package on SqueakMap
>
(http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a
201ec)
> works better for you. It follows the the HTML4 spec exactly, which by
> and large gives much better results for things like legally omitted
> close tags than the ad-hoc rules in the standard Squeak parser.
> However, it doesn't try to make any intelligent guesses about truly
> broken HTML the way browsers do.
>
> Avi

Avi

Thanks for this. I shall investigate it. I may well stick with the standard
parser, since it meets my limited purposes, but it is useful to know what
else is available.

I wonder whether the Squeak Map parser is what Bill Schwab had in mind when
he recommended a Squeak parser to Yar Hwee Bon? - he said it worked well,
and the standard Squeak parser certainly makes a lot of errors.

Best wishes

Peter Kenny

Peter Kenny-2

Re: Squeak HTML Parser - Partial port to Dolphin

In reply to this post by Avi Bryant-3

"Avi Bryant" <[hidden email]> wrote in message
news:[hidden email]...

> Depending on what you count as "correctly" parsing the input, you may
> find that the HTML-Parser package on SqueakMap
>
(http://map1.squeakfoundation.org/sm/package/69778b2e-3884-4490-b18a-1b9a86a
201ec)
> works better for you.

Avi

I have now got the SqueakMap parser working correctly under Dolphin (at
least, all the unit tests work correctly). It certainly parses more of my
test pages in a "reasonable" way than the standard Squeak parser did, but it
is far from perfect. Every test page I tried produced a string of
"DanglingCloseTag" exceptions, and most of these, when investigated, were
due to the parser disallowing an inclusion earlier on in the parse. It will
allow a <table> to be embedded in a <td>, which the standard parser will
not, but it objects to anything other than a <tr> being embedded in a
<table>. However, if I fix it to ignore all the exceptions it gives me a
parse I can use for my purposes, so I shall continue to use it instead of
the standard parser; thanks again for the suggestion.

I am not sure whether this is likely to be of interest to anyone else. It
would now take a bit of effort to tidy it up and document it, so I shall
just keep it for my own use unless anyone signals an interest.

Peter Kenny

Avi Bryant

Re: Squeak HTML Parser - Partial port to Dolphin

"Peter Kenny" <[hidden email]> wrote in message news:<[hidden email]>...

> I have now got the SqueakMap parser working correctly under Dolphin (at
> least, all the unit tests work correctly). It certainly parses more of my
> test pages in a "reasonable" way than the standard Squeak parser did, but it
> is far from perfect. Every test page I tried produced a string of
> "DanglingCloseTag" exceptions, and most of these, when investigated, were
> due to the parser disallowing an inclusion earlier on in the parse. It will
> allow a <table> to be embedded in a <td>, which the standard parser will
> not, but it objects to anything other than a <tr> being embedded in a
> <table>.

Well, presumably it would also allow <thead> and <tbody> - AFAIK those
three tags are the only ones that are legal to be directly inside a
table.

It sounds like you're dealing with some somewhat broken HTML - have
you considered running it through something like HTMLtidy before
parsing it?

At any rate, if you ignore the exceptions, I believe that parser
should at least have the property you were looking for, that the
output string will always be the same as the input string, no matter
how convoluted a tree it has to construct to follow the rules of the
spec.

Peter Kenny-2

Re: Squeak HTML Parser - Partial port to Dolphin

"Avi Bryant" <[hidden email]> wrote in message
news:[hidden email]...

> Well, presumably it would also allow <thead> and <tbody> - AFAIK those
> three tags are the only ones that are legal to be directly inside a
> table.
>
> It sounds like you're dealing with some somewhat broken HTML - have
> you considered running it through something like HTMLtidy before
> parsing it?
>
> At any rate, if you ignore the exceptions, I believe that parser
> should at least have the property you were looking for, that the
> output string will always be the same as the input string, no matter
> how convoluted a tree it has to construct to follow the rules of the
> spec.

Avi

I have gone carefully through the parse output from one page (actually a
story from the newspaper 'Die Welt') to see where the exceptions came from.
There were three inclusions that caused most, or maybe all, of the dozen or
so exceptions (in each case 'inside' means 'immediately inside'):
a. <img> inside <noscript>;
b. <form> inside <table> ( with <tr> inside <form> and <td> inside <tr>);
c. <meta> inside <td>.
These are all probably violations of the HTML rules, but browsers seem to
interpret them as valid inclusions. The SqueakMap parser interprets each as
a case of a missing end tag for the outer construction, which it therefore
closes. So if I print out the parse tree, these end tags appear in the wrong
place, and the structure is changed. In case c. above, for example, what
followed the <meta> inside the <td> was another <table>; the parser
generated end tags for the <td> and its containing <tr> and <table>, and
then put the included table as the next element in the outer level construct
(actually the <body>).

However, none of this matters too much to me as long as it operates
consistently (which I suppose means that the page originators use their
peculiar constructs consistently :-) ). I just want to be able to find the
text of web newspaper stories and handle it inside my own programs, so as
long as I know where to look in the parse tree I am OK. As I said some way
back, this parser is adequate for my needs, and is certainly much easier to
use than the MSHTML stuff. So thanks again for the suggestion.

Peter Kenny

Yar Hwee Boon-3

Re: Squeak HTML Parser - Partial port to Dolphin

In reply to this post by Peter Kenny-2

Peter (or Bill?), may I know if the port is available online? I interested
in having a look at it, if possible. Thanks.

--
Regards
HweeBoon
MotionObj

Peter Kenny-2

Re: Squeak HTML Parser - Partial port to Dolphin

"Yar Hwee Boon" <[hidden email]> wrote in message
news:[hidden email]...
> Peter (or Bill?), may I know if the port is available online? I interested
> in having a look at it, if possible. Thanks.
>
> --
> Regards
> HweeBoon
> MotionObj

HweeBon

Things have got a bit messy since my original post in this thread, because
Avi Bryant pointed out that there is another Squeak parser available for
download on SqueakMap. I downloaded that and the necessary support files,
and the work I have done recently all uses that. However, the SqueakMap
parser also needs the same tokenizer as the standard Squeak parser, which in
turn needs some loose methods I imported from Squeak into the Dolphin base
classes. To avoid a bit of sorting out, I put everything into the one
Dolphin package, which now contains three parsers:

a. HtmlParser, the standard parser from the Squeak distribution, but with
some parse rules modified by me to avoid rejection of common inclusions; I
decided this needs too much more work, and have largely given up on it.

b. HTMLParser (note the upper case!), the SqueakMap parser, which I have
ported without changing the parse rules, though this also rejects quite a
few common inclusions (see my exchange with Avi in this thread).

c. XMLParser, also from SqueakMap, which is included because HTMLParser
depends on it in various ways - I have not used it directly.

Because the situation was still evolving, and I was not yet certain what
else I would do with it, I have not yet taken up Bill's offer to mount it on
his web site. I am hesitant about letting it out on the world, because it is
really work in progress and I am unhappy with its messy state. However, if
you would be willing to play with it on that understanding, I could e-mail a
copy direct to you. Let me know (either direct or through this group) if you
want it. The package is about 250KB, but could of course be zipped.

Best wishes

Peter