XML packages, CDATA and encoding

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

XML packages, CDATA and encoding

Torsten Bergmann
Create a file "test.xml" with the following
contents (german umlaut):


   <?xml version="1.0" encoding="iso-8859-1"?>
   <test><![CDATA[Zaunkönig]]></test>

After loading ConfigurationOfXML try to parse it:

 |fs|
 fs := FileStream fileNamed: 'test.xml'.
 XMLDOMParser parseDocumentFrom: fs.


=> gives an error: 'Invalid utf8 input detected'
=> it works if you remove the CDATA section

Looks like UTF8TextConverter is used independent
from the encoding of the XML...

Bye
T.
--
NEU: FreePhone - kostenlos mobil telefonieren und surfen!
Jetzt informieren: http://www.gmx.net/de/go/freephone

Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

NorbertHartl
Torsten,
On 10.01.2011, at 13:38, Torsten Bergmann wrote:

> Create a file "test.xml" with the following
> contents (german umlaut):
>
>
>   <?xml version="1.0" encoding="iso-8859-1"?>
>   <test><![CDATA[Zaunkönig]]></test>
>
> After loading ConfigurationOfXML try to parse it:
>
> |fs|
> fs := FileStream fileNamed: 'test.xml'.
> XMLDOMParser parseDocumentFrom: fs.
>
>
> => gives an error: 'Invalid utf8 input detected'
> => it works if you remove the CDATA section
>
> Looks like UTF8TextConverter is used independent
> from the encoding of the XML...
>
the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.

Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.

Norbert
 


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

cedreek
>
>
>> Create a file "test.xml" with the following
>> contents (german umlaut):
>>
>>
>>  <?xml version="1.0" encoding="iso-8859-1"?>
>>  <test><![CDATA[Zaunkönig]]></test>
>>
>> After loading ConfigurationOfXML try to parse it:
>>
>> |fs|
>> fs := FileStream fileNamed: 'test.xml'.
>> XMLDOMParser parseDocumentFrom: fs.
>>
>>
>> => gives an error: 'Invalid utf8 input detected'
>> => it works if you remove the CDATA section
>>
>> Looks like UTF8TextConverter is used independent
>> from the encoding of the XML...
>>
> the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.
>
> Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.
>
> Norbert

yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff)

string :=  '<?xml version="1.0" encoding="iso-8859-1"?>
  <test><![CDATA[Zaunkönig]]></test>'.

XMLDOMParser parseDocumentFrom: fs contents.

hth,

Cédrick

>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

Henrik Sperre Johansen
On 11.01.2011 16:58, Cédrick Béler wrote:

>>
>>> Create a file "test.xml" with the following
>>> contents (german umlaut):
>>>
>>>
>>>   <?xml version="1.0" encoding="iso-8859-1"?>
>>>   <test><![CDATA[Zaunkönig]]></test>
>>>
>>> After loading ConfigurationOfXML try to parse it:
>>>
>>> |fs|
>>> fs := FileStream fileNamed: 'test.xml'.
>>> XMLDOMParser parseDocumentFrom: fs.
>>>
>>>
>>> =>  gives an error: 'Invalid utf8 input detected'
>>> =>  it works if you remove the CDATA section
>>>
>>> Looks like UTF8TextConverter is used independent
>>> from the encoding of the XML...
>>>
>> the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.
>>
>> Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.
>>
>> Norbert
> yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff)
>
> string :=  '<?xml version="1.0" encoding="iso-8859-1"?>
>    <test><![CDATA[Zaunkönig]]></test>'.
>
> XMLDOMParser parseDocumentFrom: fs contents.
>
> hth,
>
> Cédrick
Of course it is the job of the parser:
http://www.w3.org/TR/REC-xml/#charencoding

The XMLSupport package is oblivious to this however, and only works on
internal streams.

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

cedreek




Le 11 janv. 2011 à 20:20, Henrik Sperre Johansen <[hidden email]> a écrit :

On 11.01.2011 16:58, Cédrick Béler wrote:

Create a file "test.xml" with the following
contents (german umlaut):


 <?xml version="1.0" encoding="iso-8859-1"?>
 <test><![CDATA[Zaunkönig]]></test>

After loading ConfigurationOfXML try to parse it:

|fs|
fs := FileStream fileNamed: 'test.xml'.
XMLDOMParser parseDocumentFrom: 

Actually, I tried from a file, and it works too.
Pharo 1.1, Cog, OSX, recent version of XML support (from squeaksource)

Cheers,

Cédrick







=>  gives an error: 'Invalid utf8 input detected'
=>  it works if you remove the CDATA section

Looks like UTF8TextConverter is used independent
from the encoding of the XML...

the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.

Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.

Norbert
yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff)

string :=  '<?xml version="1.0" encoding="iso-8859-1"?>
  <test><![CDATA[Zaunkönig]]></test>'.

XMLDOMParser parseDocumentFrom: fs contents.

hth,

Cédrick
Of course it is the job of the parser:
http://www.w3.org/TR/REC-xml/#charencoding

The XMLSupport package is oblivious to this however, and only works on internal streams.

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

Henrik Sperre Johansen
On 12.01.2011 02:04, Cédrick Béler wrote:




Le 11 janv. 2011 à 20:20, Henrik Sperre Johansen <[hidden email]> a écrit :

On 11.01.2011 16:58, Cédrick Béler wrote:

Create a file "test.xml" with the following
contents (german umlaut):


 <?xml version="1.0" encoding="iso-8859-1"?>
 <test><![CDATA[Zaunkönig]]></test>

After loading ConfigurationOfXML try to parse it:

|fs|
fs := FileStream fileNamed: 'test.xml'.
XMLDOMParser parseDocumentFrom: 

Actually, I tried from a file, and it works too.
Pharo 1.1, Cog, OSX, recent version of XML support (from squeaksource)

Cheers,

Cédrick

Then the file you created was not actually 8859-1 encoded, but rather in utf8.
Try with the attached.

Cheers,
Henry

test.xml (114 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

Stéphane Ducasse
In reply to this post by Torsten Bergmann
I was using soup to scrap some url and soup worked well with information I got from a stream over the url but it barked when I saved the file on the disc.
I have to write a good metacello conf for soup and for my code to identify if this is a bug or not.

stef


On Jan 10, 2011, at 1:38 PM, Torsten Bergmann wrote:

> Create a file "test.xml" with the following
> contents (german umlaut):
>
>
>   <?xml version="1.0" encoding="iso-8859-1"?>
>   <test><![CDATA[Zaunkönig]]></test>
>
> After loading ConfigurationOfXML try to parse it:
>
> |fs|
> fs := FileStream fileNamed: 'test.xml'.
> XMLDOMParser parseDocumentFrom: fs.
>
>
> => gives an error: 'Invalid utf8 input detected'
> => it works if you remove the CDATA section
>
> Looks like UTF8TextConverter is used independent
> from the encoding of the XML...
>
> Bye
> T.
> --
> NEU: FreePhone - kostenlos mobil telefonieren und surfen!
> Jetzt informieren: http://www.gmx.net/de/go/freephone
>


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

jaayer
In reply to this post by Torsten Bergmann




---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann  wrote ----

>Create a file "test.xml" with the following
>contents (german umlaut):
>
>
>  
>  
>
>After loading ConfigurationOfXML try to parse it:
>
> |fs|
> fs := FileStream fileNamed: 'test.xml'.
> XMLDOMParser parseDocumentFrom: fs.
>
>
>=> gives an error: 'Invalid utf8 input detected'
>=> it works if you remove the CDATA section
>
>Looks like UTF8TextConverter is used independent
>from the encoding of the XML...
>
>Bye
>T.
>--
>NEU: FreePhone - kostenlos mobil telefonieren und surfen!            
>Jetzt informieren: http://www.gmx.net/de/go/freephone 
>
>


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

jaayer
In reply to this post by Torsten Bergmann

Sorry, I accidentally hit reply before typing anything.

Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is:
        converterClass :=
                (Smalltalk
                        at: #TextConverter
                        ifAbsent: [^ self])
                                defaultConverterClassForEncoding: anEncodingName asLowercase.

But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
        ^ #('latin-1' 'latin1') copy.

Change it to this  (note the lowercase):
        ^ #('latin-1' 'latin1' 'iso-8859-1') copy.

and everything now works.

So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter?

---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann  wrote ----

>Create a file "test.xml" with the following
>contents (german umlaut):
>
>
>  
>  
>
>After loading ConfigurationOfXML try to parse it:
>
> |fs|
> fs := FileStream fileNamed: 'test.xml'.
> XMLDOMParser parseDocumentFrom: fs.
>
>
>=> gives an error: 'Invalid utf8 input detected'
>=> it works if you remove the CDATA section
>
>Looks like UTF8TextConverter is used independent
>from the encoding of the XML...
>
>Bye
>T.
>--
>NEU: FreePhone - kostenlos mobil telefonieren und surfen!            
>Jetzt informieren: http://www.gmx.net/de/go/freephone 
>
>


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

Stéphane Ducasse
Hi jaayer

> Sorry, I accidentally hit reply before typing anything.

no problem :)

>
> Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is:
> converterClass :=
> (Smalltalk
> at: #TextConverter
> ifAbsent: [^ self])
> defaultConverterClassForEncoding: anEncodingName asLowercase.
>
> But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
> ^ #('latin-1' 'latin1') copy.
>
> Change it to this  (note the lowercase):
> ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>
> and everything now works.
>
> So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter?


nobody maintain TextConverter.
If you have some fixes I would be really glad to integrate them.



>
> ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann  wrote ----
>
>> Create a file "test.xml" with the following
>> contents (german umlaut):
>>
>>
>>
>>
>>
>> After loading ConfigurationOfXML try to parse it:
>>
>> |fs|
>> fs := FileStream fileNamed: 'test.xml'.
>> XMLDOMParser parseDocumentFrom: fs.
>>
>>
>> => gives an error: 'Invalid utf8 input detected'
>> => it works if you remove the CDATA section
>>
>> Looks like UTF8TextConverter is used independent
>> from the encoding of the XML...
>>
>> Bye
>> T.
>> --
>> NEU: FreePhone - kostenlos mobil telefonieren und surfen!            
>> Jetzt informieren: http://www.gmx.net/de/go/freephone 
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

Stéphane Ducasse
In reply to this post by jaayer
I created
        http://code.google.com/p/pharo/issues/detail?id=3541
to avoid to forget.

Stef



On Jan 16, 2011, at 3:09 AM, jaayer wrote:

>
> Sorry, I accidentally hit reply before typing anything.
>
> Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is:
> converterClass :=
> (Smalltalk
> at: #TextConverter
> ifAbsent: [^ self])
> defaultConverterClassForEncoding: anEncodingName asLowercase.
>
> But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
> ^ #('latin-1' 'latin1') copy.
>
> Change it to this  (note the lowercase):
> ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>
> and everything now works.
>
> So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter?
>
> ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann  wrote ----
>
>> Create a file "test.xml" with the following
>> contents (german umlaut):
>>
>>
>>
>>
>>
>> After loading ConfigurationOfXML try to parse it:
>>
>> |fs|
>> fs := FileStream fileNamed: 'test.xml'.
>> XMLDOMParser parseDocumentFrom: fs.
>>
>>
>> => gives an error: 'Invalid utf8 input detected'
>> => it works if you remove the CDATA section
>>
>> Looks like UTF8TextConverter is used independent
>> from the encoding of the XML...
>>
>> Bye
>> T.
>> --
>> NEU: FreePhone - kostenlos mobil telefonieren und surfen!            
>> Jetzt informieren: http://www.gmx.net/de/go/freephone 
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: XML packages, CDATA and encoding

Sven Van Caekenberghe
In reply to this post by jaayer

On 16 Jan 2011, at 03:09, jaayer wrote:

> But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
> ^ #('latin-1' 'latin1') copy.
>
> Change it to this  (note the lowercase):
> ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>
> and everything now works.

I had seen this as well in the past: this should be changed.
I think this is a safe change, it just adds an alternative name to an existing encoding.

Sven