Create a file "test.xml" with the following
contents (german umlaut): <?xml version="1.0" encoding="iso-8859-1"?> <test><![CDATA[Zaunkönig]]></test> After loading ConfigurationOfXML try to parse it: |fs| fs := FileStream fileNamed: 'test.xml'. XMLDOMParser parseDocumentFrom: fs. => gives an error: 'Invalid utf8 input detected' => it works if you remove the CDATA section Looks like UTF8TextConverter is used independent from the encoding of the XML... Bye T. -- NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://www.gmx.net/de/go/freephone |
Torsten,
On 10.01.2011, at 13:38, Torsten Bergmann wrote: > Create a file "test.xml" with the following > contents (german umlaut): > > > <?xml version="1.0" encoding="iso-8859-1"?> > <test><![CDATA[Zaunkönig]]></test> > > After loading ConfigurationOfXML try to parse it: > > |fs| > fs := FileStream fileNamed: 'test.xml'. > XMLDOMParser parseDocumentFrom: fs. > > > => gives an error: 'Invalid utf8 input detected' > => it works if you remove the CDATA section > > Looks like UTF8TextConverter is used independent > from the encoding of the XML... > Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding. Norbert |
>
> >> Create a file "test.xml" with the following >> contents (german umlaut): >> >> >> <?xml version="1.0" encoding="iso-8859-1"?> >> <test><![CDATA[Zaunkönig]]></test> >> >> After loading ConfigurationOfXML try to parse it: >> >> |fs| >> fs := FileStream fileNamed: 'test.xml'. >> XMLDOMParser parseDocumentFrom: fs. >> >> >> => gives an error: 'Invalid utf8 input detected' >> => it works if you remove the CDATA section >> >> Looks like UTF8TextConverter is used independent >> from the encoding of the XML... >> > the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally. > > Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding. > > Norbert yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff) string := '<?xml version="1.0" encoding="iso-8859-1"?> <test><![CDATA[Zaunkönig]]></test>'. XMLDOMParser parseDocumentFrom: fs contents. hth, Cédrick > > > |
On 11.01.2011 16:58, Cédrick Béler wrote:
>> >>> Create a file "test.xml" with the following >>> contents (german umlaut): >>> >>> >>> <?xml version="1.0" encoding="iso-8859-1"?> >>> <test><![CDATA[Zaunkönig]]></test> >>> >>> After loading ConfigurationOfXML try to parse it: >>> >>> |fs| >>> fs := FileStream fileNamed: 'test.xml'. >>> XMLDOMParser parseDocumentFrom: fs. >>> >>> >>> => gives an error: 'Invalid utf8 input detected' >>> => it works if you remove the CDATA section >>> >>> Looks like UTF8TextConverter is used independent >>> from the encoding of the XML... >>> >> the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally. >> >> Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding. >> >> Norbert > yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff) > > string := '<?xml version="1.0" encoding="iso-8859-1"?> > <test><![CDATA[Zaunkönig]]></test>'. > > XMLDOMParser parseDocumentFrom: fs contents. > > hth, > > Cédrick http://www.w3.org/TR/REC-xml/#charencoding The XMLSupport package is oblivious to this however, and only works on internal streams. Cheers, Henry |
Actually, I tried from a file, and it works too. Pharo 1.1, Cog, OSX, recent version of XML support (from squeaksource) Cheers, Cédrick
|
On 12.01.2011 02:04, Cédrick Béler wrote:
Then the file you created was not actually 8859-1 encoded, but rather in utf8. Try with the attached. Cheers, Henry test.xml (114 bytes) Download Attachment |
In reply to this post by Torsten Bergmann
I was using soup to scrap some url and soup worked well with information I got from a stream over the url but it barked when I saved the file on the disc.
I have to write a good metacello conf for soup and for my code to identify if this is a bug or not. stef On Jan 10, 2011, at 1:38 PM, Torsten Bergmann wrote: > Create a file "test.xml" with the following > contents (german umlaut): > > > <?xml version="1.0" encoding="iso-8859-1"?> > <test><![CDATA[Zaunkönig]]></test> > > After loading ConfigurationOfXML try to parse it: > > |fs| > fs := FileStream fileNamed: 'test.xml'. > XMLDOMParser parseDocumentFrom: fs. > > > => gives an error: 'Invalid utf8 input detected' > => it works if you remove the CDATA section > > Looks like UTF8TextConverter is used independent > from the encoding of the XML... > > Bye > T. > -- > NEU: FreePhone - kostenlos mobil telefonieren und surfen! > Jetzt informieren: http://www.gmx.net/de/go/freephone > |
In reply to this post by Torsten Bergmann
---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ---- >Create a file "test.xml" with the following >contents (german umlaut): > > > > > >After loading ConfigurationOfXML try to parse it: > > |fs| > fs := FileStream fileNamed: 'test.xml'. > XMLDOMParser parseDocumentFrom: fs. > > >=> gives an error: 'Invalid utf8 input detected' >=> it works if you remove the CDATA section > >Looks like UTF8TextConverter is used independent >from the encoding of the XML... > >Bye >T. >-- >NEU: FreePhone - kostenlos mobil telefonieren und surfen! >Jetzt informieren: http://www.gmx.net/de/go/freephone > > |
In reply to this post by Torsten Bergmann
Sorry, I accidentally hit reply before typing anything. Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is: converterClass := (Smalltalk at: #TextConverter ifAbsent: [^ self]) defaultConverterClassForEncoding: anEncodingName asLowercase. But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1": ^ #('latin-1' 'latin1') copy. Change it to this (note the lowercase): ^ #('latin-1' 'latin1' 'iso-8859-1') copy. and everything now works. So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter? ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ---- >Create a file "test.xml" with the following >contents (german umlaut): > > > > > >After loading ConfigurationOfXML try to parse it: > > |fs| > fs := FileStream fileNamed: 'test.xml'. > XMLDOMParser parseDocumentFrom: fs. > > >=> gives an error: 'Invalid utf8 input detected' >=> it works if you remove the CDATA section > >Looks like UTF8TextConverter is used independent >from the encoding of the XML... > >Bye >T. >-- >NEU: FreePhone - kostenlos mobil telefonieren und surfen! >Jetzt informieren: http://www.gmx.net/de/go/freephone > > |
Hi jaayer
> Sorry, I accidentally hit reply before typing anything. no problem :) > > Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is: > converterClass := > (Smalltalk > at: #TextConverter > ifAbsent: [^ self]) > defaultConverterClassForEncoding: anEncodingName asLowercase. > > But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1": > ^ #('latin-1' 'latin1') copy. > > Change it to this (note the lowercase): > ^ #('latin-1' 'latin1' 'iso-8859-1') copy. > > and everything now works. > > So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter? nobody maintain TextConverter. If you have some fixes I would be really glad to integrate them. > > ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ---- > >> Create a file "test.xml" with the following >> contents (german umlaut): >> >> >> >> >> >> After loading ConfigurationOfXML try to parse it: >> >> |fs| >> fs := FileStream fileNamed: 'test.xml'. >> XMLDOMParser parseDocumentFrom: fs. >> >> >> => gives an error: 'Invalid utf8 input detected' >> => it works if you remove the CDATA section >> >> Looks like UTF8TextConverter is used independent >> from the encoding of the XML... >> >> Bye >> T. >> -- >> NEU: FreePhone - kostenlos mobil telefonieren und surfen! >> Jetzt informieren: http://www.gmx.net/de/go/freephone >> >> > > |
In reply to this post by jaayer
I created
http://code.google.com/p/pharo/issues/detail?id=3541 to avoid to forget. Stef On Jan 16, 2011, at 3:09 AM, jaayer wrote: > > Sorry, I accidentally hit reply before typing anything. > > Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is: > converterClass := > (Smalltalk > at: #TextConverter > ifAbsent: [^ self]) > defaultConverterClassForEncoding: anEncodingName asLowercase. > > But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1": > ^ #('latin-1' 'latin1') copy. > > Change it to this (note the lowercase): > ^ #('latin-1' 'latin1' 'iso-8859-1') copy. > > and everything now works. > > So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter? > > ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ---- > >> Create a file "test.xml" with the following >> contents (german umlaut): >> >> >> >> >> >> After loading ConfigurationOfXML try to parse it: >> >> |fs| >> fs := FileStream fileNamed: 'test.xml'. >> XMLDOMParser parseDocumentFrom: fs. >> >> >> => gives an error: 'Invalid utf8 input detected' >> => it works if you remove the CDATA section >> >> Looks like UTF8TextConverter is used independent >> from the encoding of the XML... >> >> Bye >> T. >> -- >> NEU: FreePhone - kostenlos mobil telefonieren und surfen! >> Jetzt informieren: http://www.gmx.net/de/go/freephone >> >> > > |
In reply to this post by jaayer
On 16 Jan 2011, at 03:09, jaayer wrote: > But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1": > ^ #('latin-1' 'latin1') copy. > > Change it to this (note the lowercase): > ^ #('latin-1' 'latin1' 'iso-8859-1') copy. > > and everything now works. I had seen this as well in the past: this should be changed. I think this is a safe change, it just adds an alternative name to an existing encoding. Sven |
Free forum by Nabble | Edit this page |