XMLParser Claims U+00A0 is “Invalid UTF-8”

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

XMLParser Claims U+00A0 is “Invalid UTF-8”

Sean P. DeNigris
Administrator
Posted to StackOverflow (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):



Given the input:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />

Where the character after the "." in the body attribute of the sms tag is U+00A0;

I get the error:

    XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)

IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia. Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.

This seems like a bug in XMLParser, or am I missing something?
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

monty-3
Just to be sure, I manually recreated your file (with the great Bless hex editor) and parsed it with no issue.

Please post your code and attach the actual source as a file separately.

> Sent: Thursday, July 28, 2016 at 3:12 PM
> From: "Sean P. DeNigris" <[hidden email]>
> To: [hidden email]
> Subject: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Posted to StackOverflow
> (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):
>
>
>
> Given the input:
>
> <?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
> <sms body=". what" />
>
> Where the character after the "." in the body attribute of the sms tag is
> U+00A0;
>
> I get the error:
>
>     XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column
> 13)
>
> IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia.
> Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.
>
> This seems like a bug in XMLParser, or am I missing something?
>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sean P. DeNigris
Administrator
monty-3 wrote
Just to be sure, I manually recreated your file (with the great Bless hex editor) and parsed it with no issue.
Thanks!

monty-3 wrote
Please post your code and attach the actual source as a file separately.
The code is merely:
  messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
  doc := XMLDOMParser parse: messageLog.

File: illegal-UTF-sms.xml
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sven Van Caekenberghe-2
Sean,

Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is served from the URL you gave.

(('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) at: 72 ) = 160 asCharacter.

  "true"

Like you said,

160 asCharacter asString utf8Encoded.

  "#[194 160]"

But

#[ 160 ] utf8Decoded.

  Boom!

You specify UTF-8 encoding inside your XML, I assume the parser then switches to that encoding, but your pure Unicode contents is not UTF-8 encoded and results in an exception. You see ?

Sven

> On 28 Jul 2016, at 22:05, Sean P. DeNigris <[hidden email]> wrote:
>
> monty-3 wrote
>> Just to be sure, I manually recreated your file (with the great Bless hex
>> editor) and parsed it with no issue.
>
> Thanks!
>
>
> monty-3 wrote
>> Please post your code and attach the actual source as a file separately.
>
> The code is merely:
>  messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
>  doc := XMLDOMParser parse: messageLog.
>
> File:  illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>  
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>


Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sean P. DeNigris
Administrator
Sven Van Caekenberghe-2 wrote
Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is served from the URL you gave.
..
You see ?
Unfortunately, no! ha ha. I didn't generate the file and I took it's assertion that it was UTF-8 at face value. How do I properly feed the file into XMLParser?
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sven Van Caekenberghe-2
In my older work image, the following just works:

XMLDOMParser parse:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).

But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.

You could try to edit the incoming file, or have a look at #decodesCharacters:

(XMLDOMParser on:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.

But I am no expert in the deeper aspects of XML Support.

> On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
>
> Sven Van Caekenberghe-2 wrote
>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>> it is served from the URL you gave.
>> ..
>> You see ?
>
> Unfortunately, no! ha ha. I didn't generate the file and I took it's
> assertion that it was UTF-8 at face value. How do I properly feed the file
> into XMLParser?
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>


Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

monty-3
In reply to this post by Sean P. DeNigris
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

 The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

 There is an encoding declaration with a non-UTF-8 encoding.

 There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.

> Sent: Thursday, July 28, 2016 at 4:05 PM
> From: "Sean P. DeNigris" <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> monty-3 wrote
> > Just to be sure, I manually recreated your file (with the great Bless hex
> > editor) and parsed it with no issue.
>
> Thanks!
>
>
> monty-3 wrote
> > Please post your code and attach the actual source as a file separately.
>
> The code is merely:
>   messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
>   doc := XMLDOMParser parse: messageLog.
>
> File:  illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>  
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sean P. DeNigris
Administrator
monty-3 wrote
You're double decoding
And in public, no less! Thanks. It works now with #parseFileNamed:. Minus side - half a day wasted; plus side - I wrote a compatibility layer for Magritte-XMLBinding to accept SoupTags to #fromXmlNode:
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

monty-3
In reply to this post by Sven Van Caekenberghe-2
Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.

#parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.

> Sent: Thursday, July 28, 2016 at 5:29 PM
> From: "Sven Van Caekenberghe" <[hidden email]>
> To: "Any question about pharo is welcome" <[hidden email]>
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> In my older work image, the following just works:
>
> XMLDOMParser parse:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
>
> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
>
> You could try to edit the incoming file, or have a look at #decodesCharacters:
>
> (XMLDOMParser on:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
>
> But I am no expert in the deeper aspects of XML Support.
>
> > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
> >
> > Sven Van Caekenberghe-2 wrote
> >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> >> it is served from the URL you gave.
> >> ..
> >> You see ?
> >
> > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > assertion that it was UTF-8 at face value. How do I properly feed the file
> > into XMLParser?
> >
> >
> >
> > -----
> > Cheers,
> > Sean
> > --
> > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> >
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

monty-3
Also #parseURL:/#onURL: will use WebClient on Squeak (unless Zinc is present of course)

> Sent: Thursday, July 28, 2016 at 6:15 PM
> From: monty <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
>
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.
>
> > Sent: Thursday, July 28, 2016 at 5:29 PM
> > From: "Sven Van Caekenberghe" <[hidden email]>
> > To: "Any question about pharo is welcome" <[hidden email]>
> > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
> >
> > In my older work image, the following just works:
> >
> > XMLDOMParser parse:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
> >
> > But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
> >
> > You could try to edit the incoming file, or have a look at #decodesCharacters:
> >
> > (XMLDOMParser on:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
> >
> > But I am no expert in the deeper aspects of XML Support.
> >
> > > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
> > >
> > > Sven Van Caekenberghe-2 wrote
> > >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> > >> it is served from the URL you gave.
> > >> ..
> > >> You see ?
> > >
> > > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > > assertion that it was UTF-8 at face value. How do I properly feed the file
> > > into XMLParser?
> > >
> > >
> > >
> > > -----
> > > Cheers,
> > > Sean
> > > --
> > > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> > >
> >
> >
> >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sven Van Caekenberghe-2
In reply to this post by monty-3

> On 29 Jul 2016, at 00:15, monty <[hidden email]> wrote:
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
>
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.

Yes, you are right. Thanks for implementing all this logic, I known it is quite complicated and tricky.

>> Sent: Thursday, July 28, 2016 at 5:29 PM
>> From: "Sven Van Caekenberghe" <[hidden email]>
>> To: "Any question about pharo is welcome" <[hidden email]>
>> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>>
>> In my older work image, the following just works:
>>
>> XMLDOMParser parse:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
>>
>> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
>>
>> You could try to edit the incoming file, or have a look at #decodesCharacters:
>>
>> (XMLDOMParser on:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
>>
>> But I am no expert in the deeper aspects of XML Support.
>>
>>> On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>>>> it is served from the URL you gave.
>>>> ..
>>>> You see ?
>>>
>>> Unfortunately, no! ha ha. I didn't generate the file and I took it's
>>> assertion that it was UTF-8 at face value. How do I properly feed the file
>>> into XMLParser?
>>>
>>>
>>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>
>>
>>
>