How should XMLHTMLParser handle strange HTML?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How should XMLHTMLParser handle strange HTML?

Peter Kenny

Hello

 

I have come across a strange problem in using XMLHTMLParser to parse some HTML files which use strange constructions. The input files have been generated by using MS Outlook to translate incoming messages, stored in .msg files, into HTML. The translated files display normally in Firefox, and the XMLHTMLParser appears to generate a normal parse, but examination of the parse output shows that the structure is distorted, and about half the input text has been put into one string node.

 

Hunting around, I am convinced that the trouble lies in the presence in the HTML source of pairs of comment-like tags, with this form:

<![if !supportLists]>

<![endif]>

since the distorted parse starts at the first occurrence of one of these tags.

 

I don’t know whether these are meant to be a structure in some programming language – there is no reference to supportLists anywhere in the source code. When it is displayed in Firefox, use of the ‘Inspect Element’ option shows that the browser has treated them as comments, displaying them with the necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code by inserting the dashes, and XMLHTMLParser parsed everything correctly.

 

I have a workaround, therefore; either edit in the dashes to make them into legitimate comments, or equivalently edit out these tags completely. The only question of general interest is whether XMLHTMLParser should be expected to handle these in some other way, rather than produce a distorted parse without comment. The Firefox approach, turning them into comments, seems sensible. It would also be interesting if anyone has any idea what is going on in the source code.

 

Thanks for any help

 

Peter Kenny

Reply | Threaded
Open this post in threaded view
|

Re: How should XMLHTMLParser handle strange HTML?

Esteban A. Maringolo
Hi Peter,


Just in case it helps you parsing the files...

I had to parse HTML with a XMLParser (no XMLHTMLParser) so what I did
was to pass it first through html tidy [1] converting it to xhtml
which is compatible with XML parsers (it is XML, after all).

Regards,

[1] http://www.html-tidy.org/

Esteban A. Maringolo

On Thu, Apr 2, 2020 at 2:17 PM PBKResearch <[hidden email]> wrote:

>
> Hello
>
>
>
> I have come across a strange problem in using XMLHTMLParser to parse some HTML files which use strange constructions. The input files have been generated by using MS Outlook to translate incoming messages, stored in .msg files, into HTML. The translated files display normally in Firefox, and the XMLHTMLParser appears to generate a normal parse, but examination of the parse output shows that the structure is distorted, and about half the input text has been put into one string node.
>
>
>
> Hunting around, I am convinced that the trouble lies in the presence in the HTML source of pairs of comment-like tags, with this form:
>
> <![if !supportLists]>
>
> <![endif]>
>
> since the distorted parse starts at the first occurrence of one of these tags.
>
>
>
> I don’t know whether these are meant to be a structure in some programming language – there is no reference to supportLists anywhere in the source code. When it is displayed in Firefox, use of the ‘Inspect Element’ option shows that the browser has treated them as comments, displaying them with the necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code by inserting the dashes, and XMLHTMLParser parsed everything correctly.
>
>
>
> I have a workaround, therefore; either edit in the dashes to make them into legitimate comments, or equivalently edit out these tags completely. The only question of general interest is whether XMLHTMLParser should be expected to handle these in some other way, rather than produce a distorted parse without comment. The Firefox approach, turning them into comments, seems sensible. It would also be interesting if anyone has any idea what is going on in the source code.
>
>
>
> Thanks for any help
>
>
>
> Peter Kenny

Reply | Threaded
Open this post in threaded view
|

Re: How should XMLHTMLParser handle strange HTML?

Peter Kenny
Hi Esteban

Thanks for the suggestion. I have skimmed through the description of tidy. I think the things it puts right (mis-matched tags etc.) are exactly the things that XMLHTMLParser looks for and fixes. For example, in my distorted parses, the final </body> and </html> tags had been absorbed into the massive string node that contains most of the input text; the parser detected this and inserted them at the right point to close the parse correctly.
Since my workaround, of editing out the specific features that cause the parse to go wrong, seems to fix the problem, I shall probably continue with it.

Thanks for your help

Peter Kenny

-----Original Message-----
From: Pharo-users <[hidden email]> On Behalf Of Esteban Maringolo
Sent: 02 April 2020 19:53
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

Hi Peter,


Just in case it helps you parsing the files...

I had to parse HTML with a XMLParser (no XMLHTMLParser) so what I did was to pass it first through html tidy [1] converting it to xhtml which is compatible with XML parsers (it is XML, after all).

Regards,

[1] http://www.html-tidy.org/

Esteban A. Maringolo

On Thu, Apr 2, 2020 at 2:17 PM PBKResearch <[hidden email]> wrote:

>
> Hello
>
>
>
> I have come across a strange problem in using XMLHTMLParser to parse some HTML files which use strange constructions. The input files have been generated by using MS Outlook to translate incoming messages, stored in .msg files, into HTML. The translated files display normally in Firefox, and the XMLHTMLParser appears to generate a normal parse, but examination of the parse output shows that the structure is distorted, and about half the input text has been put into one string node.
>
>
>
> Hunting around, I am convinced that the trouble lies in the presence in the HTML source of pairs of comment-like tags, with this form:
>
> <![if !supportLists]>
>
> <![endif]>
>
> since the distorted parse starts at the first occurrence of one of these tags.
>
>
>
> I don’t know whether these are meant to be a structure in some programming language – there is no reference to supportLists anywhere in the source code. When it is displayed in Firefox, use of the ‘Inspect Element’ option shows that the browser has treated them as comments, displaying them with the necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code by inserting the dashes, and XMLHTMLParser parsed everything correctly.
>
>
>
> I have a workaround, therefore; either edit in the dashes to make them into legitimate comments, or equivalently edit out these tags completely. The only question of general interest is whether XMLHTMLParser should be expected to handle these in some other way, rather than produce a distorted parse without comment. The Firefox approach, turning them into comments, seems sensible. It would also be interesting if anyone has any idea what is going on in the source code.
>
>
>
> Thanks for any help
>
>
>
> Peter Kenny


Reply | Threaded
Open this post in threaded view
|

Re: How should XMLHTMLParser handle strange HTML?

Michal Balda
In reply to this post by Peter Kenny

Hello Peter,

Those are called conditional comments. They come from MS Word which is used as the HTML rendering engine for MS Outlook. There is not much documentation available online specifically for MS Word but they were also implemented in older versions of MS Internet Explorer and used commonly by web designers to fix bugs and quirks in IE's rendering. See Wikipedia:

https://en.wikipedia.org/wiki/Conditional_comment

Or just search for "internet explorer conditional comments", you will find plenty of resources.

The ones in your example are the "downlevel-revealed" sort of conditional comments meaning that the content between the "if" and "endif" is visible to all browsers. The "if" and "endif" themselves are recognized by MS Word (and MS Internet Explorer) and evaluated as conditions while they are ignored by other web browsers.

The syntax is based on the original SGML syntax which is the precursor to HTML. In this form it is invalid in HTML but standard browsers can handle it and do the meaningful thing. There exists an alternative form (also described by the Wikipedia page) which is valid HTML and still works as a conditional comment:

<!--[if !supportLists]><!-->
<!--<![endif]-->

Just converting it to "<!--[if !supportLists]-->" causes it to lose its meaning: it won't be recognized any more but if you don't need to open it in MS Word again it doesn't matter.

To answer your question: What should an HTML parser do? I think it depends on the use case. What XMLHTMLParser does now is wrong. To be correct, it could signal an error since it's invalid HTML (like an HTML validator would), or it could ignore the syntax error in an unknown element and continue parsing (like a browser would). Standard HTML processors choose the second approach and try to fix what they can to produce what they think is most meaningful. In this case they are smart enough to realize that it's probably meant to be a comment. To me, something like a resumable exception would be acceptable: one could make two wrappers, a strict one and a loose one, and choose the one that better fits the situation.

(An XML parser, on the other hand, must always signal an exception and abort parsing in case of a syntax error, as per the specification.)


Michal



On 2.4.2020 19:16, PBKResearch wrote:

Hello

 

I have come across a strange problem in using XMLHTMLParser to parse some HTML files which use strange constructions. The input files have been generated by using MS Outlook to translate incoming messages, stored in .msg files, into HTML. The translated files display normally in Firefox, and the XMLHTMLParser appears to generate a normal parse, but examination of the parse output shows that the structure is distorted, and about half the input text has been put into one string node.

 

Hunting around, I am convinced that the trouble lies in the presence in the HTML source of pairs of comment-like tags, with this form:

<![if !supportLists]>

<![endif]>

since the distorted parse starts at the first occurrence of one of these tags.

 

I don’t know whether these are meant to be a structure in some programming language – there is no reference to supportLists anywhere in the source code. When it is displayed in Firefox, use of the ‘Inspect Element’ option shows that the browser has treated them as comments, displaying them with the necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code by inserting the dashes, and XMLHTMLParser parsed everything correctly.

 

I have a workaround, therefore; either edit in the dashes to make them into legitimate comments, or equivalently edit out these tags completely. The only question of general interest is whether XMLHTMLParser should be expected to handle these in some other way, rather than produce a distorted parse without comment. The Firefox approach, turning them into comments, seems sensible. It would also be interesting if anyone has any idea what is going on in the source code.

 

Thanks for any help

 

Peter Kenny


Reply | Threaded
Open this post in threaded view
|

Re: How should XMLHTMLParser handle strange HTML?

Peter Kenny

Hello Michal

 

Many thanks for your comprehensive explanation. I am using these translations just as a vehicle to get the text of my incoming e-mails into Pharo, so the conditional comments have no relevance to my use and are just clutter. I shall therefore continue preprocessing the received HTML to turn them into legal comments. I have found that they come with various forms of ‘if’, so the rewriting just turns ‘<![‘ into ‘<!--[‘ and ‘]>’ into ‘]-->’. I do this with String>>replaceAll:with:, but it might be interesting to write a PetitParser job to eliminate them completely. We can only hope that MS will sometime realise the meaning of the word ‘deprecated’; I am using the latest Outlook in Office 365, and these forms are only relevant to long-dead versions of IE.

 

Probably you are right that XMLHTMLParser should handle them tidily, but whether it is worth the effort depends on how often they turn up in the wild. It appears I may be the only person so far to have hit this, just because of my combination of Outlook with Pharo. I shan’t be clamouring for any quick change.

 

Thanks again for your help.

 

Peter Kenny

 

From: Pharo-users <[hidden email]> On Behalf Of Michal Balda
Sent: 03 April 2020 16:45
To: [hidden email]
Subject: Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

 

Hello Peter,

Those are called conditional comments. They come from MS Word which is used as the HTML rendering engine for MS Outlook. There is not much documentation available online specifically for MS Word but they were also implemented in older versions of MS Internet Explorer and used commonly by web designers to fix bugs and quirks in IE's rendering. See Wikipedia:

https://en.wikipedia.org/wiki/Conditional_comment

Or just search for "internet explorer conditional comments", you will find plenty of resources.

The ones in your example are the "downlevel-revealed" sort of conditional comments meaning that the content between the "if" and "endif" is visible to all browsers. The "if" and "endif" themselves are recognized by MS Word (and MS Internet Explorer) and evaluated as conditions while they are ignored by other web browsers.

The syntax is based on the original SGML syntax which is the precursor to HTML. In this form it is invalid in HTML but standard browsers can handle it and do the meaningful thing. There exists an alternative form (also described by the Wikipedia page) which is valid HTML and still works as a conditional comment:

<!--[if !supportLists]><!-->
<!--<![endif]-->

Just converting it to "<!--[if !supportLists]-->" causes it to lose its meaning: it won't be recognized any more but if you don't need to open it in MS Word again it doesn't matter.

To answer your question: What should an HTML parser do? I think it depends on the use case. What XMLHTMLParser does now is wrong. To be correct, it could signal an error since it's invalid HTML (like an HTML validator would), or it could ignore the syntax error in an unknown element and continue parsing (like a browser would). Standard HTML processors choose the second approach and try to fix what they can to produce what they think is most meaningful. In this case they are smart enough to realize that it's probably meant to be a comment. To me, something like a resumable exception would be acceptable: one could make two wrappers, a strict one and a loose one, and choose the one that better fits the situation.

(An XML parser, on the other hand, must always signal an exception and abort parsing in case of a syntax error, as per the specification.)

 

Michal

 

 

On 2.4.2020 19:16, PBKResearch wrote:

Hello

 

I have come across a strange problem in using XMLHTMLParser to parse some HTML files which use strange constructions. The input files have been generated by using MS Outlook to translate incoming messages, stored in .msg files, into HTML. The translated files display normally in Firefox, and the XMLHTMLParser appears to generate a normal parse, but examination of the parse output shows that the structure is distorted, and about half the input text has been put into one string node.

 

Hunting around, I am convinced that the trouble lies in the presence in the HTML source of pairs of comment-like tags, with this form:

<![if !supportLists]>

<![endif]>

since the distorted parse starts at the first occurrence of one of these tags.

 

I don’t know whether these are meant to be a structure in some programming language – there is no reference to supportLists anywhere in the source code. When it is displayed in Firefox, use of the ‘Inspect Element’ option shows that the browser has treated them as comments, displaying them with the necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code by inserting the dashes, and XMLHTMLParser parsed everything correctly.

 

I have a workaround, therefore; either edit in the dashes to make them into legitimate comments, or equivalently edit out these tags completely. The only question of general interest is whether XMLHTMLParser should be expected to handle these in some other way, rather than produce a distorted parse without comment. The Firefox approach, turning them into comments, seems sensible. It would also be interesting if anyone has any idea what is going on in the source code.

 

Thanks for any help

 

Peter Kenny