Administrator
|
Posted to StackOverflow (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8):
Given the input: <?xml version='1.0' encoding='UTF-8' standalone='yes' ?> <sms body=". what" /> Where the character after the "." in the body attribute of the sms tag is U+00A0; I get the error: XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13) IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia. Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively. This seems like a bug in XMLParser, or am I missing something?
Cheers,
Sean |
Just to be sure, I manually recreated your file (with the great Bless hex editor) and parsed it with no issue.
Please post your code and attach the actual source as a file separately. > Sent: Thursday, July 28, 2016 at 3:12 PM > From: "Sean P. DeNigris" <[hidden email]> > To: [hidden email] > Subject: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8” > > Posted to StackOverflow > (https://stackoverflow.com/questions/38645553/xmlparser-in-pharo-claims-u00a0-is-invalid-utf-8): > > > > Given the input: > > <?xml version='1.0' encoding='UTF-8' standalone='yes' ?> > <sms body=". what" /> > > Where the character after the "." in the body attribute of the sms tag is > U+00A0; > > I get the error: > > XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column > 13) > > IIUC, the UTF-8 representation of that character is 0xC2 0xA0 per Wikipedia. > Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively. > > This seems like a bug in XMLParser, or am I missing something? > > > > > ----- > Cheers, > Sean > -- > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > > |
Administrator
|
Thanks! The code is merely: messageLog := FileLocator home / 'illegal-UTF-sms.xml'. doc := XMLDOMParser parse: messageLog. File: illegal-UTF-sms.xml
Cheers,
Sean |
Sean,
Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way it is served from the URL you gave. (('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) at: 72 ) = 160 asCharacter. "true" Like you said, 160 asCharacter asString utf8Encoded. "#[194 160]" But #[ 160 ] utf8Decoded. Boom! You specify UTF-8 encoding inside your XML, I assume the parser then switches to that encoding, but your pure Unicode contents is not UTF-8 encoded and results in an exception. You see ? Sven > On 28 Jul 2016, at 22:05, Sean P. DeNigris <[hidden email]> wrote: > > monty-3 wrote >> Just to be sure, I manually recreated your file (with the great Bless hex >> editor) and parsed it with no issue. > > Thanks! > > > monty-3 wrote >> Please post your code and attach the actual source as a file separately. > > The code is merely: > messageLog := FileLocator home / 'illegal-UTF-sms.xml'. > doc := XMLDOMParser parse: messageLog. > > File: illegal-UTF-sms.xml > <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml> > > > > ----- > Cheers, > Sean > -- > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > |
Administrator
|
Unfortunately, no! ha ha. I didn't generate the file and I took it's assertion that it was UTF-8 at face value. How do I properly feed the file into XMLParser?
Cheers,
Sean |
In my older work image, the following just works:
XMLDOMParser parse: ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents). But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient. You could try to edit the incoming file, or have a look at #decodesCharacters: (XMLDOMParser on: ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument. But I am no expert in the deeper aspects of XML Support. > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote: > > Sven Van Caekenberghe-2 wrote >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way >> it is served from the URL you gave. >> .. >> You see ? > > Unfortunately, no! ha ha. I didn't generate the file and I took it's > assertion that it was UTF-8 at face value. How do I properly feed the file > into XMLParser? > > > > ----- > Cheers, > Sean > -- > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > |
In reply to this post by Sean P. DeNigris
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.
Longer explanation: The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then. File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if: The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte. There is an encoding declaration with a non-UTF-8 encoding. There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case). So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode. > Sent: Thursday, July 28, 2016 at 4:05 PM > From: "Sean P. DeNigris" <[hidden email]> > To: [hidden email] > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8” > > monty-3 wrote > > Just to be sure, I manually recreated your file (with the great Bless hex > > editor) and parsed it with no issue. > > Thanks! > > > monty-3 wrote > > Please post your code and attach the actual source as a file separately. > > The code is merely: > messageLog := FileLocator home / 'illegal-UTF-sms.xml'. > doc := XMLDOMParser parse: messageLog. > > File: illegal-UTF-sms.xml > <http://forum.world.st/file/n4908531/illegal-UTF-sms.xml> > > > > ----- > Cheers, > Sean > -- > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908531.html > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > > |
Administrator
|
And in public, no less! Thanks. It works now with #parseFileNamed:. Minus side - half a day wasted; plus side - I wrote a compatibility layer for Magritte-XMLBinding to accept SoupTags to #fromXmlNode:
Cheers,
Sean |
In reply to this post by Sven Van Caekenberghe-2
Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires.
#parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it. > Sent: Thursday, July 28, 2016 at 5:29 PM > From: "Sven Van Caekenberghe" <[hidden email]> > To: "Any question about pharo is welcome" <[hidden email]> > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8” > > In my older work image, the following just works: > > XMLDOMParser parse: > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents). > > But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient. > > You could try to edit the incoming file, or have a look at #decodesCharacters: > > (XMLDOMParser on: > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument. > > But I am no expert in the deeper aspects of XML Support. > > > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote: > > > > Sven Van Caekenberghe-2 wrote > >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way > >> it is served from the URL you gave. > >> .. > >> You see ? > > > > Unfortunately, no! ha ha. I didn't generate the file and I took it's > > assertion that it was UTF-8 at face value. How do I properly feed the file > > into XMLParser? > > > > > > > > ----- > > Cheers, > > Sean > > -- > > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html > > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > > > > > |
Also #parseURL:/#onURL: will use WebClient on Squeak (unless Zinc is present of course)
> Sent: Thursday, July 28, 2016 at 6:15 PM > From: monty <[hidden email]> > To: [hidden email] > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8” > > Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires. > > #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it. > > > Sent: Thursday, July 28, 2016 at 5:29 PM > > From: "Sven Van Caekenberghe" <[hidden email]> > > To: "Any question about pharo is welcome" <[hidden email]> > > Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8” > > > > In my older work image, the following just works: > > > > XMLDOMParser parse: > > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents). > > > > But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient. > > > > You could try to edit the incoming file, or have a look at #decodesCharacters: > > > > (XMLDOMParser on: > > ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument. > > > > But I am no expert in the deeper aspects of XML Support. > > > > > On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote: > > > > > > Sven Van Caekenberghe-2 wrote > > >> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way > > >> it is served from the URL you gave. > > >> .. > > >> You see ? > > > > > > Unfortunately, no! ha ha. I didn't generate the file and I took it's > > > assertion that it was UTF-8 at face value. How do I properly feed the file > > > into XMLParser? > > > > > > > > > > > > ----- > > > Cheers, > > > Sean > > > -- > > > View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html > > > Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. > > > > > > > > > > > |
In reply to this post by monty-3
> On 29 Jul 2016, at 00:15, monty <[hidden email]> wrote: > > Good for finding one of the fixes, but please use #parseURL:/#onURL: instead of #asUrl/#asZnUrl with #retrieveContents, because that will result in Zinc eagerly decoding the response without looking at the <?xml ?> declaration as the XML spec requires. > > #parseURL:/#onURL: use Zinc correctly, doing their own XML-aware encoding on top of it. Yes, you are right. Thanks for implementing all this logic, I known it is quite complicated and tricky. >> Sent: Thursday, July 28, 2016 at 5:29 PM >> From: "Sven Van Caekenberghe" <[hidden email]> >> To: "Any question about pharo is welcome" <[hidden email]> >> Subject: Re: [Pharo-users] XMLParser Claims U+00A0 is “Invalid UTF-8” >> >> In my older work image, the following just works: >> >> XMLDOMParser parse: >> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents). >> >> But I guess that is because my (older) XML parser version ignores the encoding, or is more lenient. >> >> You could try to edit the incoming file, or have a look at #decodesCharacters: >> >> (XMLDOMParser on: >> ('http://forum.world.st/file/n4908531/illegal-UTF-sms.xml' asUrl retrieveContents) readStream) decodesCharacters: false; parseDocument. >> >> But I am no expert in the deeper aspects of XML Support. >> >>> On 28 Jul 2016, at 22:29, Sean P. DeNigris <[hidden email]> wrote: >>> >>> Sven Van Caekenberghe-2 wrote >>>> Your XML file is not UTF-8 encoded, it is plain Unicode. At least the way >>>> it is served from the URL you gave. >>>> .. >>>> You see ? >>> >>> Unfortunately, no! ha ha. I didn't generate the file and I took it's >>> assertion that it was UTF-8 at face value. How do I properly feed the file >>> into XMLParser? >>> >>> >>> >>> ----- >>> Cheers, >>> Sean >>> -- >>> View this message in context: http://forum.world.st/XMLParser-Claims-U-00A0-is-Invalid-UTF-8-tp4908525p4908539.html >>> Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com. >>> >> >> >> > |
Free forum by Nabble | Edit this page |