Smalltalk › Pharo › Pharo Smalltalk Users

XMLParser Claims U+00A0 is “Invalid UTF-8”

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

Sean P. DeNigris

XMLParser Claims U+00A0 is “Invalid UTF-8”

Administrator

Posted to StackOverflow (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):

Given the input:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />

Where the character after the "." in the body attribute of the sms tag is U+00A0;

I get the error:

XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)

IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia. Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.

This seems like a bug in XMLParser, or am I missing something?

Cheers,
Sean

monty-3

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Just to be sure, I manually recreated your file (with the great Bless hex editor) and parsed it with no issue.

Please post your code and attach the actual source as a file separately.

> Sent: Thursday, July 28, 2016 at 3:12 PM
> From: "Sean P. DeNigris" <[hidden email]>
> To: [hidden email]
> Subject: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Posted to StackOverflow
> (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):
>
>
>
> Given the input:
>
> <?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
> <sms body=". what" />
>
> Where the character after the "." in the body attribute of the sms tag is
> U+00A0;
>
> I get the error:
>
> XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column
> 13)
>
> IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia.
> Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.
>
> This seems like a bug in XMLParser, or am I missing something?
>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>

Sean P. DeNigris

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Administrator

monty-3 wrote

Just to be sure, I manually recreated your file (with the great Bless hex editor) and parsed it with no issue.

Thanks!

monty-3 wrote

Please post your code and attach the actual source as a file separately.

The code is merely:
messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
doc := XMLDOMParser parse: messageLog.

File: illegal-UTF-sms.xml

Cheers,
Sean

Sven Van Caekenberghe-2

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Sean,

Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is served from the URL you gave.

(('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) at: 72 ) = 160 asCharacter.

"true"

Like you said,

160 asCharacter asString utf8Encoded.

"#[194 160]"

But

#[ 160 ] utf8Decoded.

Boom!

You specify UTF-8 encoding inside your XML, I assume the parser then switches to that encoding, but your pure Unicode contents is not UTF-8 encoded and results in an exception. You see ?

Sven

> On 28 Jul 2016, at 22:05, Sean P. DeNigris <[hidden email]> wrote:
>
> monty-3 wrote
>> Just to be sure, I manually recreated your file (with the great Bless hex
>> editor) and parsed it with no issue.
>
> Thanks!
>
>
> monty-3 wrote
>> Please post your code and attach the actual source as a file separately.
>
> The code is merely:
> messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
> doc := XMLDOMParser parse: messageLog.
>
> File: illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>

Sean P. DeNigris

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Administrator

Sven Van Caekenberghe-2 wrote

Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is served from the URL you gave.
..
You see ?

Unfortunately, no! ha ha. I didn't generate the file and I took it's assertion that it was UTF-8 at face value. How do I properly feed the file into XMLParser?

Cheers,
Sean

Sven Van Caekenberghe-2

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

In my older work image, the following just works:

XMLDOMParser parse:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).

But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.

You could try to edit the incoming file, or have a look at #decodesCharacters:

(XMLDOMParser on:
('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.

But I am no expert in the deeper aspects of XML Support.

> On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
>
> Sven Van Caekenberghe-2 wrote
>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>> it is served from the URL you gave.
>> ..
>> You see ?
>
> Unfortunately, no! ha ha. I didn't generate the file and I took it's
> assertion that it was UTF-8 at face value. How do I properly feed the file
> into XMLParser?
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>

monty-3

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

In reply to this post by Sean P. DeNigris

You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

There is an encoding declaration with a non-UTF-8 encoding.

There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.

> Sent: Thursday, July 28, 2016 at 4:05 PM
> From: "Sean P. DeNigris" <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> monty-3 wrote
> > Just to be sure, I manually recreated your file (with the great Bless hex
> > editor) and parsed it with no issue.
>
> Thanks!
>
>
> monty-3 wrote
> > Please post your code and attach the actual source as a file separately.
>
> The code is merely:
> messageLog := FileLocator home / 'illegal-UTF-sms.xml'.
> doc := XMLDOMParser parse: messageLog.
>
> File: illegal-UTF-sms.xml
> <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml>
>
>
>
> -----
> Cheers,
> Sean
> --
> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html
> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>
>

Sean P. DeNigris

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Administrator

monty-3 wrote

You're double decoding

And in public, no less! Thanks. It works now with #parseFileNamed:. Minus side - half a day wasted; plus side - I wrote a compatibility layer for Magritte-XMLBinding to accept SoupTags to #fromXmlNode:

Cheers,
Sean

monty-3

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

In reply to this post by Sven Van Caekenberghe-2

Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.

#parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.

> Sent: Thursday, July 28, 2016 at 5:29 PM
> From: "Sven Van Caekenberghe" <[hidden email]>
> To: "Any question about pharo is welcome" <[hidden email]>
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> In my older work image, the following just works:
>
> XMLDOMParser parse:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
>
> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
>
> You could try to edit the incoming file, or have a look at #decodesCharacters:
>
> (XMLDOMParser on:
> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
>
> But I am no expert in the deeper aspects of XML Support.
>
> > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
> >
> > Sven Van Caekenberghe-2 wrote
> >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> >> it is served from the URL you gave.
> >> ..
> >> You see ?
> >
> > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > assertion that it was UTF-8 at face value. How do I properly feed the file
> > into XMLParser?
> >
> >
> >
> > -----
> > Cheers,
> > Sean
> > --
> > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> >
>
>
>

monty-3

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

Also #parseURL:/#onURL: will use WebClient on Squeak (unless Zinc is present of course)

> Sent: Thursday, July 28, 2016 at 6:15 PM
> From: monty <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
>
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.
>
> > Sent: Thursday, July 28, 2016 at 5:29 PM
> > From: "Sven Van Caekenberghe" <[hidden email]>
> > To: "Any question about pharo is welcome" <[hidden email]>
> > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
> >
> > In my older work image, the following just works:
> >
> > XMLDOMParser parse:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
> >
> > But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
> >
> > You could try to edit the incoming file, or have a look at #decodesCharacters:
> >
> > (XMLDOMParser on:
> > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
> >
> > But I am no expert in the deeper aspects of XML Support.
> >
> > > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
> > >
> > > Sven Van Caekenberghe-2 wrote
> > >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
> > >> it is served from the URL you gave.
> > >> ..
> > >> You see ?
> > >
> > > Unfortunately, no! ha ha. I didn't generate the file and I took it's
> > > assertion that it was UTF-8 at face value. How do I properly feed the file
> > > into XMLParser?
> > >
> > >
> > >
> > > -----
> > > Cheers,
> > > Sean
> > > --
> > > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
> > > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
> > >
> >
> >
> >
>
>

Sven Van Caekenberghe-2

Re: XMLParser Claims U+00A0 is “Invalid UTF-8”

In reply to this post by monty-3

> On 29 Jul 2016, at 00:15, monty <[hidden email]> wrote:
>
> Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
>
> #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it.

Yes, you are right. Thanks for implementing all this logic, I known it is quite complicated and tricky.

>> Sent: Thursday, July 28, 2016 at 5:29 PM
>> From: "Sven Van Caekenberghe" <[hidden email]>
>> To: "Any question about pharo is welcome" <[hidden email]>
>> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8”
>>
>> In my older work image, the following just works:
>>
>> XMLDOMParser parse:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents).
>>
>> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient.
>>
>> You could try to edit the incoming file, or have a look at #decodesCharacters:
>>
>> (XMLDOMParser on:
>> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument.
>>
>> But I am no expert in the deeper aspects of XML Support.
>>
>>> On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote:
>>>
>>> Sven Van Caekenberghe-2 wrote
>>>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way
>>>> it is served from the URL you gave.
>>>> ..
>>>> You see ?
>>>
>>> Unfortunately, no! ha ha. I didn't generate the file and I took it's
>>> assertion that it was UTF-8 at face value. How do I properly feed the file
>>> into XMLParser?
>>>
>>>
>>>
>>> -----
>>> Cheers,
>>> Sean
>>> --
>>> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html
>>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.
>>>
>>
>>
>>
>