Smalltalk › Pharo › Pharo Smalltalk Developers

XML packages, CDATA and encoding

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

12 messages Options

Torsten Bergmann

XML packages, CDATA and encoding

Create a file "test.xml" with the following
contents (german umlaut):

<?xml version="1.0" encoding="iso-8859-1"?>
<test><![CDATA[Zaunkönig]]></test>

After loading ConfigurationOfXML try to parse it:

|fs|
fs := FileStream fileNamed: 'test.xml'.
XMLDOMParser parseDocumentFrom: fs.

=> gives an error: 'Invalid utf8 input detected'
=> it works if you remove the CDATA section

Looks like UTF8TextConverter is used independent
from the encoding of the XML...

Bye
T.
--
NEU: FreePhone - kostenlos mobil telefonieren und surfen!
Jetzt informieren: http://www.gmx.net/de/go/freephone

NorbertHartl

Re: XML packages, CDATA and encoding

Torsten,
On 10.01.2011, at 13:38, Torsten Bergmann wrote:

> Create a file "test.xml" with the following
> contents (german umlaut):
>
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <test><![CDATA[Zaunkönig]]></test>
>
> After loading ConfigurationOfXML try to parse it:
>
> |fs|
> fs := FileStream fileNamed: 'test.xml'.
> XMLDOMParser parseDocumentFrom: fs.
>
>
> => gives an error: 'Invalid utf8 input detected'
> => it works if you remove the CDATA section
>
> Looks like UTF8TextConverter is used independent
> from the encoding of the XML...
>

the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.

Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.

Norbert

cedreek

Re: XML packages, CDATA and encoding

>
>
>> Create a file "test.xml" with the following
>> contents (german umlaut):
>>
>>
>> <?xml version="1.0" encoding="iso-8859-1"?>
>> <test><![CDATA[Zaunkönig]]></test>
>>
>> After loading ConfigurationOfXML try to parse it:
>>
>> |fs|
>> fs := FileStream fileNamed: 'test.xml'.
>> XMLDOMParser parseDocumentFrom: fs.
>>
>>
>> => gives an error: 'Invalid utf8 input detected'
>> => it works if you remove the CDATA section
>>
>> Looks like UTF8TextConverter is used independent
>> from the encoding of the XML...
>>
> the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.
>
> Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.
>
> Norbert

yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff)

string := '<?xml version="1.0" encoding="iso-8859-1"?>
<test><![CDATA[Zaunkönig]]></test>'.

XMLDOMParser parseDocumentFrom: fs contents.

hth,

Cédrick

>
>
>

Henrik Sperre Johansen

Re: XML packages, CDATA and encoding

On 11.01.2011 16:58, Cédrick Béler wrote:

>>
>>> Create a file "test.xml" with the following
>>> contents (german umlaut):
>>>
>>>
>>> <?xml version="1.0" encoding="iso-8859-1"?>
>>> <test><![CDATA[Zaunkönig]]></test>
>>>
>>> After loading ConfigurationOfXML try to parse it:
>>>
>>> |fs|
>>> fs := FileStream fileNamed: 'test.xml'.
>>> XMLDOMParser parseDocumentFrom: fs.
>>>
>>>
>>> => gives an error: 'Invalid utf8 input detected'
>>> => it works if you remove the CDATA section
>>>
>>> Looks like UTF8TextConverter is used independent
>>> from the encoding of the XML...
>>>
>> the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.
>>
>> Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.
>>
>> Norbert
> yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff)
>
> string := '<?xml version="1.0" encoding="iso-8859-1"?>
> <test><![CDATA[Zaunkönig]]></test>'.
>
> XMLDOMParser parseDocumentFrom: fs contents.
>
> hth,
>
> Cédrick

Of course it is the job of the parser:
http://www.w3.org/TR/REC-xml/#charencoding

The XMLSupport package is oblivious to this however, and only works on
internal streams.

Cheers,
Henry

cedreek

Re: XML packages, CDATA and encoding

Le 11 janv. 2011 à 20:20, Henrik Sperre Johansen <[hidden email]> a écrit :

On 11.01.2011 16:58, Cédrick Béler wrote:

Create a file "test.xml" with the following
contents (german umlaut):

<?xml version="1.0" encoding="iso-8859-1"?>
<test><![CDATA[Zaunkönig]]></test>

After loading ConfigurationOfXML try to parse it:

|fs|
fs := FileStream fileNamed: 'test.xml'.
XMLDOMParser parseDocumentFrom:

Actually, I tried from a file, and it works too.

Pharo 1.1, Cog, OSX, recent version of XML support (from squeaksource)

Cheers,

Cédrick

=> gives an error: 'Invalid utf8 input detected'
=> it works if you remove the CDATA section

Looks like UTF8TextConverter is used independent
from the encoding of the XML...

the problem seems not to be the xml parser. If you use FileStream>>fileNamed: the fileNamed: is delegated to FileStream class>>concreteStream which is MultiByteStream. This stream initializes itself with the utf8 converter if it isn't set intentionally.

Besides that I'm not sure if the parsing of the xml parser works correctly if the setup is properly done for latin1 encoding.

Norbert
yes, I think the same as the following works without problem (note I have the last squeaksource version for XML related stuff)

string := '<?xml version="1.0" encoding="iso-8859-1"?>
<test><![CDATA[Zaunkönig]]></test>'.

XMLDOMParser parseDocumentFrom: fs contents.

hth,

Cédrick
Of course it is the job of the parser:
http://www.w3.org/TR/REC-xml/#charencoding

The XMLSupport package is oblivious to this however, and only works on internal streams.

Cheers,
Henry

Henrik Sperre Johansen

Re: XML packages, CDATA and encoding

On 12.01.2011 02:04, Cédrick Béler wrote:

Le 11 janv. 2011 à 20:20, Henrik Sperre Johansen <[hidden email]> a écrit :

On 11.01.2011 16:58, Cédrick Béler wrote:

Create a file "test.xml" with the following

contents (german umlaut):

<?xml version="1.0" encoding="iso-8859-1"?>

<test><![CDATA[Zaunkönig]]></test>

After loading ConfigurationOfXML try to parse it:

|fs|

fs := FileStream fileNamed: 'test.xml'.

XMLDOMParser parseDocumentFrom:

Actually, I tried from a file, and it works too.

Pharo 1.1, Cog, OSX, recent version of XML support (from squeaksource)

Cheers,

Cédrick

Then the file you created was not actually 8859-1 encoded, but rather in utf8.
Try with the attached.

Cheers,
Henry

test.xml (114 bytes) Download Attachment

Stéphane Ducasse

Re: XML packages, CDATA and encoding

In reply to this post by Torsten Bergmann

I was using soup to scrap some url and soup worked well with information I got from a stream over the url but it barked when I saved the file on the disc.
I have to write a good metacello conf for soup and for my code to identify if this is a bug or not.

stef

On Jan 10, 2011, at 1:38 PM, Torsten Bergmann wrote:

jaayer

Re: XML packages, CDATA and encoding

In reply to this post by Torsten Bergmann

---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ----

>Create a file "test.xml" with the following
>contents (german umlaut):
>
>
>
>
>
>After loading ConfigurationOfXML try to parse it:
>
> |fs|
> fs := FileStream fileNamed: 'test.xml'.
> XMLDOMParser parseDocumentFrom: fs.
>
>
>=> gives an error: 'Invalid utf8 input detected'
>=> it works if you remove the CDATA section
>
>Looks like UTF8TextConverter is used independent
>from the encoding of the XML...
>
>Bye
>T.
>--
>NEU: FreePhone - kostenlos mobil telefonieren und surfen!
>Jetzt informieren: http://www.gmx.net/de/go/freephone
>
>

jaayer

Re: XML packages, CDATA and encoding

In reply to this post by Torsten Bergmann

Sorry, I accidentally hit reply before typing anything.

Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is:
converterClass :=
(Smalltalk
at: #TextConverter
ifAbsent: [^ self])
defaultConverterClassForEncoding: anEncodingName asLowercase.

But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
^ #('latin-1' 'latin1') copy.

Change it to this (note the lowercase):
^ #('latin-1' 'latin1' 'iso-8859-1') copy.

and everything now works.

So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter?

---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ----

Stéphane Ducasse

Re: XML packages, CDATA and encoding

Hi jaayer

> Sorry, I accidentally hit reply before typing anything.

no problem :)

>
> Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is:
> converterClass :=
> (Smalltalk
> at: #TextConverter
> ifAbsent: [^ self])
> defaultConverterClassForEncoding: anEncodingName asLowercase.
>
> But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
> ^ #('latin-1' 'latin1') copy.
>
> Change it to this (note the lowercase):
> ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>
> and everything now works.
>
> So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter?

nobody maintain TextConverter.
If you have some fixes I would be really glad to integrate them.

>
> ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ----
>
>> Create a file "test.xml" with the following
>> contents (german umlaut):
>>
>>
>>
>>
>>
>> After loading ConfigurationOfXML try to parse it:
>>
>> |fs|
>> fs := FileStream fileNamed: 'test.xml'.
>> XMLDOMParser parseDocumentFrom: fs.
>>
>>
>> => gives an error: 'Invalid utf8 input detected'
>> => it works if you remove the CDATA section
>>
>> Looks like UTF8TextConverter is used independent
>> from the encoding of the XML...
>>
>> Bye
>> T.
>> --
>> NEU: FreePhone - kostenlos mobil telefonieren und surfen!
>> Jetzt informieren: http://www.gmx.net/de/go/freephone
>>
>>
>
>

Stéphane Ducasse

Re: XML packages, CDATA and encoding

In reply to this post by jaayer

I created
http://code.google.com/p/pharo/issues/detail?id=3541
to avoid to forget.

Stef

On Jan 16, 2011, at 3:09 AM, jaayer wrote:

>
> Sorry, I accidentally hit reply before typing anything.
>
> Torsten, what you are trying to do is not incorrect and should work as you expected it to. The reason why it didn't has less to do with XMLSupport per se and more to do with its reliance on Pharo's TextConverter system. The problem is faulty matching of the "encoding" attribute value to the appropriate subclass of TextConverter. The code responsible for this in XMLSupport is:
> converterClass :=
> (Smalltalk
> at: #TextConverter
> ifAbsent: [^ self])
> defaultConverterClassForEncoding: anEncodingName asLowercase.
>
> But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
> ^ #('latin-1' 'latin1') copy.
>
> Change it to this (note the lowercase):
> ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>
> and everything now works.
>
> So this is really a bug in TextConverter and its Latin1TextConverter subclass, not XMLSupport. Also, the #allSubclassesDo: test in #defaultConverterClassForEncoding: should probably be augmented with a Dictionary cache to speed-up lookups for known encoding-converter pairs. Can someone forward this message to whoever maintains TextConverter?
>
> ---- On Mon, 10 Jan 2011 04:38:23 -0800 Torsten Bergmann wrote ----
>
>> Create a file "test.xml" with the following
>> contents (german umlaut):
>>
>>
>>
>>
>>
>> After loading ConfigurationOfXML try to parse it:
>>
>> |fs|
>> fs := FileStream fileNamed: 'test.xml'.
>> XMLDOMParser parseDocumentFrom: fs.
>>
>>
>> => gives an error: 'Invalid utf8 input detected'
>> => it works if you remove the CDATA section
>>
>> Looks like UTF8TextConverter is used independent
>> from the encoding of the XML...
>>
>> Bye
>> T.
>> --
>> NEU: FreePhone - kostenlos mobil telefonieren und surfen!
>> Jetzt informieren: http://www.gmx.net/de/go/freephone
>>
>>
>
>

Sven Van Caekenberghe

Re: XML packages, CDATA and encoding

In reply to this post by jaayer

On 16 Jan 2011, at 03:09, jaayer wrote:

> But as you can see, the matching is actually done by TextConverter and its class-side #defaultConverterClassForEncoding: method, which works by sending #encodingNames to all subclasses and testing the array returned to see if it includes the specified encoding name. If you browse Latin1TextConverter, the right class for the encoding you specified, and look at its #encodingNames message, you will see the array it returns does not include "ISO-8859-1":
> ^ #('latin-1' 'latin1') copy.
>
> Change it to this (note the lowercase):
> ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>
> and everything now works.

I had seen this as well in the past: this should be changed.
I think this is a safe change, it just adds an alternative name to an existing encoding.

Sven