Problem with input to XML Parser - 'Invalid UTF8 encoding'

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with input to XML Parser - 'Invalid UTF8 encoding'

Peter Kenny

In another thread (on SVG Icons) Sven referred to ways of getting input from a URL for XMLDOMParser. I have recently had some problems doing this. I have found a workaround, so it is not urgent, but I thought I should put it on record in case anyone else is bitten by it, and so maybe Monty can look at it.

 

I am using the subclass XMLHTMLParser, and my usual way of invoking it was:

  1. XMLHTMLParser parseURL: <urlstring>.

This works in most cases, but with one particular site - http://www.corriere.it/....., which is an Italian newspaper – I had frequent failures, with the error message ‘Invalid UTF8 encoding’. The parser has the option of parsing a string, which is obtained by other means, so I tried reading it in with Zinc:

  1. XMLHTMLParser parse: <urlstring> asZnUrl retrieveContents.

And this worked, so clearly the encoding on the site is OK. I realised that the XML-Parser package has its own methods, which reproduce a lot of the functionality of Zinc, so I tried the equivalent:

  1. XMLHTMLParser parse: <urlstring> asXMLURI get.

To my surprise, this worked equally well. I had expected problems, because presumably forms (1) and (3) use the same UTF8 decoding.

 

For now, I am using the form (3) for all my work, and have not had any problems since. So the message to anyone who is using the form (1) and getting problems is to try (2) or (3) instead.

 

I am using Moose 6.1 (Pharo 6.0 Latest update: #60486) on Windows 10. I think most articles on the Corriere web site will generate the error, but one which has always failed for me is:

http://www.corriere.it/esteri/17_ottobre_03/felipe-spagna-catalogna-discorso-8f7ac0d6-a86d-11e7-a090-96160224e787.shtml

I tried to trace through the error using the debugger, but it got too confusing. However, I did establish that the failure occurred early in decoding the HTML <head>, in the line beginning <meta name="description"…… The only unusual thing at this point is the French-style open-quote: ‘«’. Whether this could explain the problem I don’t know.

 

Any suggestions gratefully received.

 

Peter Kenny

 

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Paul DeBruicker
in the HEAD tag of that page with the article they declare it is ISO-8859-1
and not UTF-8.  In the page they have a  

C’è

The little back-tick next to the C is UTF8 8217
(http://www.codetable.net/decimal/8217)


So their encoding is messed up, and maybe the XMLHTMLParser should throw a
warning or something if there is a mismatch.  


Glad you found a work around.








In another thread (on SVG Icons) Sven referred to ways of getting input from
a URL for XMLDOMParser. I have recently had some problems doing this. I have
found a workaround, so it is not urgent, but I thought I should put it on
record in case anyone else is bitten by it, and so maybe Monty can look at
it.

 

I am using the subclass XMLHTMLParser, and my usual way of invoking it was:

1. XMLHTMLParser parseURL: <urlstring>.

This works in most cases, but with one particular site -
http://www.corriere.it/....., which is an Italian newspaper - I had frequent
failures, with the error message 'Invalid UTF8 encoding'. The parser has the
option of parsing a string, which is obtained by other means, so I tried
reading it in with Zinc:

2. XMLHTMLParser parse: <urlstring> asZnUrl retrieveContents.

And this worked, so clearly the encoding on the site is OK. I realised that
the XML-Parser package has its own methods, which reproduce a lot of the
functionality of Zinc, so I tried the equivalent:

3. XMLHTMLParser parse: <urlstring> asXMLURI get.

To my surprise, this worked equally well. I had expected problems, because
presumably forms (1) and (3) use the same UTF8 decoding.

 

For now, I am using the form (3) for all my work, and have not had any
problems since. So the message to anyone who is using the form (1) and
getting problems is to try (2) or (3) instead.

 

I am using Moose 6.1 (Pharo 6.0 Latest update: #60486) on Windows 10. I
think most articles on the Corriere web site will generate the error, but
one which has always failed for me is:

http://www.corriere.it/esteri/17_ottobre_03/felipe-spagna-catalogna-discorso
-8f7ac0d6-a86d-11e7-a090-96160224e787.shtml

I tried to trace through the error using the debugger, but it got too
confusing. However, I did establish that the failure occurred early in
decoding the HTML <head>, in the line beginning <meta
name=&quot;description&quot;..
The only unusual thing at this point is the French-style open-quote: '&lt;'.
Whether this could explain the problem I don't know.

 

Any suggestions gratefully received.

 

Peter Kenny
&lt;/quote>




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Peter Kenny
Paul

Good to have found the charset discrepancy - that may have something to do with it. But I don't think it has to do with the C’è in the body of the page. I have just parsed another page published today, with the same error, and again it fails in parsing the <head> node, so it has not even reached the body. The <head> contains a meta which describes the article - a sort of paraphrase of the article headline - and it fails in the middle of decoding that. The character at which it fails is again $«, so that is definitely the cause. Maybe the wrong charset is the explanation of why it messes up that - but I don't know enough about the different charsets to know. Does ISO-8859-1 even contain $«?

Peter

-----Original Message-----
From: Pharo-users [mailto:[hidden email]] On Behalf Of Paul DeBruicker
Sent: 08 October 2017 18:41
To: [hidden email]
Subject: Re: [Pharo-users] Problem with input to XML Parser - 'Invalid UTF8 encoding'

in the HEAD tag of that page with the article they declare it is ISO-8859-1 and not UTF-8.  In the page they have a  

C’è

The little back-tick next to the C is UTF8 8217
(http://www.codetable.net/decimal/8217)


So their encoding is messed up, and maybe the XMLHTMLParser should throw a warning or something if there is a mismatch.  


Glad you found a work around.








In another thread (on SVG Icons) Sven referred to ways of getting input from a URL for XMLDOMParser. I have recently had some problems doing this. I have found a workaround, so it is not urgent, but I thought I should put it on record in case anyone else is bitten by it, and so maybe Monty can look at it.

 

I am using the subclass XMLHTMLParser, and my usual way of invoking it was:

1. XMLHTMLParser parseURL: <urlstring>.

This works in most cases, but with one particular site - http://www.corriere.it/....., which is an Italian newspaper - I had frequent failures, with the error message 'Invalid UTF8 encoding'. The parser has the option of parsing a string, which is obtained by other means, so I tried reading it in with Zinc:

2. XMLHTMLParser parse: <urlstring> asZnUrl retrieveContents.

And this worked, so clearly the encoding on the site is OK. I realised that the XML-Parser package has its own methods, which reproduce a lot of the functionality of Zinc, so I tried the equivalent:

3. XMLHTMLParser parse: <urlstring> asXMLURI get.

To my surprise, this worked equally well. I had expected problems, because presumably forms (1) and (3) use the same UTF8 decoding.

 

For now, I am using the form (3) for all my work, and have not had any problems since. So the message to anyone who is using the form (1) and getting problems is to try (2) or (3) instead.

 

I am using Moose 6.1 (Pharo 6.0 Latest update: #60486) on Windows 10. I think most articles on the Corriere web site will generate the error, but one which has always failed for me is:

http://www.corriere.it/esteri/17_ottobre_03/felipe-spagna-catalogna-discorso
-8f7ac0d6-a86d-11e7-a090-96160224e787.shtml

I tried to trace through the error using the debugger, but it got too confusing. However, I did establish that the failure occurred early in decoding the HTML <head>, in the line beginning <meta name=&quot;description&quot;..
The only unusual thing at this point is the French-style open-quote: '&lt;'.
Whether this could explain the problem I don't know.

 

Any suggestions gratefully received.

 

Peter Kenny
&lt;/quote>




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html


Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Peter Kenny
Note: This was sent on Sunday at 19.45 but seems to have disappeared on its
way to pharo users. Re-sent just to complete the story.
_____________________________________________________
Paul

Good to have found the charset discrepancy - that may have something to do
with it. But I don't think it has to do with the C’è in the body of the
page. I have just parsed another page published today, with the same error,
and again it fails in parsing the <head> node, so it has not even reached
the body. The <head> contains a meta which describes the article - a sort of
paraphrase of the article headline - and it fails in the middle of decoding
that. The character at which it fails is again $«, so that is definitely the
cause. Maybe the wrong charset is the explanation of why it messes up that -
but I don't know enough about the different charsets to know. Does
ISO-8859-1 even contain $«?

Peter

Addendum: I have looked a bit further, and the charset problem lies behind
it all. The ISO-8859-1 charset *does* include $«, at decimal 171 or hex AB.
At the point where it fails, the parser is reading the string '«B', which is
hex AB 42 in ISO-8859-1, and the debugger shows that the parser is trying to
decode hex AB 42 as a multibyte UTF8 character. So there are two questions
remaining: (a) why does the parser try to decode it as UTF8? (b) why does
reading the string in before calling the parser get round the problem?




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Henrik Sperre Johansen
XML expects a prolog in the document itself defining the encoding, if absent,
the standard specifies utf-8.
So when you use an XML parser to parse an HTML page, it will disregard any
HTTP encodings, interpret the contents as an XML document with missing
prolog, and try to parse as utf8.

When you use ZnUrl getContents however, it respects the HTTP charset header
field, which correctly identifies the contents as 8859-1, and lets you
correctly read it into an internal string.
Subsequently parsing said internal string, the XML parser won't try to do
any conversion, and therefore works.

Cheers,
Henry



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Henrik Sperre Johansen
In a class named XMLHTMLParser, you may expect that logic to be expanded a
bit beyond the basic XML spec though.
But since there are multiple potentially correct definitions, there will
always be failure cases.
Not to mention, in addition to XML/HTTP, HTML4/5 also define (different)
meta tags for specifying encoding (+ have different default encodings), so
really, the number of potentially correct encodings is even higher.




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Peter Kenny
In reply to this post by Henrik Sperre Johansen
Henry

Thanks for the explanations. It's a bit clearer now. I'm still not sure
about how ZnUrl>>retrieveContents manages to decode correctly in this case;
I'm sure I recall Sven saying it didn't (and in his view shouldn't) look at
the HTTP declarations in the header. There is also the mystery of how the
string reader in the XML-Parser package (XMLURI>>get) does the same trick,
when it is presumably what XMLHTMLParser>>parseURL: uses and fails.

However, all these are second order problems. It all begins because the
Corriere web site does strange things with encoding, including using a UTF8
character in a page coded with 8859-1, as Paul pointed out. In any case,
reading the page as a string and then parsing it solves my problem, so I
shall stick to that as a standard procedure. Most importantly, I don't think
there is any indication of a problem in the XML package for Monty to worry
about.

Thanks again

Peter



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Peter Kenny
Correction - I am misrepresenting Sven. What he said was that Zinc would not
look inside the HTML <head> node to find out about coding. It would of
course use information in the HTTP headers, if any.


Peter Kenny wrote

> Henry
>
> Thanks for the explanations. It's a bit clearer now. I'm still not sure
> about how ZnUrl>>retrieveContents manages to decode correctly in this
> case;
> I'm sure I recall Sven saying it didn't (and in his view shouldn't) look
> at
> the HTTP declarations in the header. There is also the mystery of how the
> string reader in the XML-Parser package (XMLURI>>get) does the same trick,
> when it is presumably what XMLHTMLParser>>parseURL: uses and fails.
>
> However, all these are second order problems. It all begins because the
> Corriere web site does strange things with encoding, including using a
> UTF8
> character in a page coded with 8859-1, as Paul pointed out. In any case,
> reading the page as a string and then parsing it solves my problem, so I
> shall stick to that as a standard procedure. Most importantly, I don't
> think
> there is any indication of a problem in the XML package for Monty to worry
> about.
>
> Thanks again
>
> Peter
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html





--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

monty-3
I know what the problem is and will have it fixed shortly. Thanks for the report.

> Sent: Monday, October 09, 2017 at 9:03 AM
> From: "Peter Kenny" <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] Problem with input to XML Parser - 'Invalid UTF8 encoding'
>
> Correction - I am misrepresenting Sven. What he said was that Zinc would not
> look inside the HTML <head> node to find out about coding. It would of
> course use information in the HTTP headers, if any.
>
>
> Peter Kenny wrote
> > Henry
> >
> > Thanks for the explanations. It's a bit clearer now. I'm still not sure
> > about how ZnUrl>>retrieveContents manages to decode correctly in this
> > case;
> > I'm sure I recall Sven saying it didn't (and in his view shouldn't) look
> > at
> > the HTTP declarations in the header. There is also the mystery of how the
> > string reader in the XML-Parser package (XMLURI>>get) does the same trick,
> > when it is presumably what XMLHTMLParser>>parseURL: uses and fails.
> >
> > However, all these are second order problems. It all begins because the
> > Corriere web site does strange things with encoding, including using a
> > UTF8
> > character in a page coded with 8859-1, as Paul pointed out. In any case,
> > reading the page as a string and then parsing it solves my problem, so I
> > shall stick to that as a standard procedure. Most importantly, I don't
> > think
> > there is any indication of a problem in the XML package for Monty to worry
> > about.
> >
> > Thanks again
> >
> > Peter
> >
> >
> >
> > --
> > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>
>
>
>
>
> --
> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Problem with input to XML Parser - 'Invalid UTF8 encoding'

Stephane Ducasse-3
Tx

On Wed, Oct 11, 2017 at 7:29 AM, monty <[hidden email]> wrote:

> I know what the problem is and will have it fixed shortly. Thanks for the report.
>
>> Sent: Monday, October 09, 2017 at 9:03 AM
>> From: "Peter Kenny" <[hidden email]>
>> To: [hidden email]
>> Subject: Re: [Pharo-users] Problem with input to XML Parser - 'Invalid UTF8 encoding'
>>
>> Correction - I am misrepresenting Sven. What he said was that Zinc would not
>> look inside the HTML <head> node to find out about coding. It would of
>> course use information in the HTTP headers, if any.
>>
>>
>> Peter Kenny wrote
>> > Henry
>> >
>> > Thanks for the explanations. It's a bit clearer now. I'm still not sure
>> > about how ZnUrl>>retrieveContents manages to decode correctly in this
>> > case;
>> > I'm sure I recall Sven saying it didn't (and in his view shouldn't) look
>> > at
>> > the HTTP declarations in the header. There is also the mystery of how the
>> > string reader in the XML-Parser package (XMLURI>>get) does the same trick,
>> > when it is presumably what XMLHTMLParser>>parseURL: uses and fails.
>> >
>> > However, all these are second order problems. It all begins because the
>> > Corriere web site does strange things with encoding, including using a
>> > UTF8
>> > character in a page coded with 8859-1, as Paul pointed out. In any case,
>> > reading the page as a string and then parsing it solves my problem, so I
>> > shall stick to that as a standard procedure. Most importantly, I don't
>> > think
>> > there is any indication of a problem in the XML package for Monty to worry
>> > about.
>> >
>> > Thanks again
>> >
>> > Peter
>> >
>> >
>> >
>> > --
>> > Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>
>>
>>
>>
>>
>> --
>> Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html
>>
>>
>