Smalltalk › Pharo › Pharo Smalltalk Developers

String input not in UTF-8

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Hilaire Fernandes

String input not in UTF-8

It seems like when inputing accented character it is not by default in
UTF-8.
Is it the case with Pharo 1.3 ?

Hilaire

--
Education 0.2 -- http://blog.ofset.org/hilaire

Henrik Sperre Johansen

Re: String input not in UTF-8

On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:

> It seems like when inputing accented character it is not by default in
> UTF-8.
> Is it the case with Pharo 1.3 ?
>
> Hilaire
>
>
> --
> Education 0.2 -- http://blog.ofset.org/hilaire

I'm not sure what you mean.
When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints.

Cheers,
Henry

Hilaire Fernandes

Re: String input not in UTF-8

Le 05/08/2011 13:28, Henrik Johansen a écrit :

>
> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>
>> It seems like when inputing accented character it is not by default in
>> UTF-8.
>> Is it the case with Pharo 1.3 ?
>>
>> Hilaire
>>
>>
>> --
>> Education 0.2 -- http://blog.ofset.org/hilaire
>
> I'm not sure what you mean.
> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints.

Is seems it is 8 bits chars, when exported through XMLParser, it is
8bits string. I need to investigate further.

Hilaire

--
Education 0.2 -- http://blog.ofset.org/hilaire

Henrik Sperre Johansen

Re: String input not in UTF-8

On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote:

> Le 05/08/2011 13:28, Henrik Johansen a écrit :
>>
>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>>
>>> It seems like when inputing accented character it is not by default in
>>> UTF-8.
>>> Is it the case with Pharo 1.3 ?
>>>
>>> Hilaire
>>>
>>>
>>> --
>>> Education 0.2 -- http://blog.ofset.org/hilaire
>>
>> I'm not sure what you mean.
>> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints.
>
> Is seems it is 8 bits chars, when exported through XMLParser, it is
> 8bits string. I need to investigate further.
>
> Hilaire

It is an 8-bit character, since the codePoint fits in one byte. (see a)
Accented characters like é could be either:
a) One Unicode codepoint (U+00E9 (decimal 233) small acute e )
b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ).

Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints.
b) would be a WideString, as 769 does not fit in a byte.

However, if correctly converted to UTF8, their representations should be;
a) represented in 2 bytes ; 16r C3A9
b) represented in 3 bytes: 16r CD81 65.

Ie. it seems XMLParser does not encode it properly to utf8 when exporting.
Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252.
(starts with <?xml version="1.0" encoding="windows-1252" ?> or some such)
Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug.

Cheers,
Henry

Hilaire Fernandes

Re: String input not in UTF-8

I gave a look at the latest XMLParser but the API is different with a
lot broken code on my face. Does XMLWriter class>>on: obsolete ? It bugs
me with that but the class and method are still there, a Monticello
trick I forget about?
I don't even now how to port to new API. Is there a port guide?
I guess this is for the better, but still frustrating and distracting
from the main task...

Le 05/08/2011 16:23, Henrik Johansen a écrit :

>
> On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote:
>
>> Le 05/08/2011 13:28, Henrik Johansen a écrit :
>>>
>>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>>>
>>>> It seems like when inputing accented character it is not by default in
>>>> UTF-8.
>>>> Is it the case with Pharo 1.3 ?
>>>>
>>>> Hilaire
>>>>
>>>>
>>>> --
>>>> Education 0.2 -- http://blog.ofset.org/hilaire
>>>
>>> I'm not sure what you mean.
>>> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints.
>>
>> Is seems it is 8 bits chars, when exported through XMLParser, it is
>> 8bits string. I need to investigate further.
>>
>> Hilaire
> It is an 8-bit character, since the codePoint fits in one byte. (see a)
> Accented characters like é could be either:
> a) One Unicode codepoint (U+00E9 (decimal 233) small acute e )
> b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ).
>
> Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints.
> b) would be a WideString, as 769 does not fit in a byte.
>
> However, if correctly converted to UTF8, their representations should be;
> a) represented in 2 bytes ; 16r C3A9
> b) represented in 3 bytes: 16r CD81 65.
>
> Ie. it seems XMLParser does not encode it properly to utf8 when exporting.
> Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252.
> (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such)
> Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug.
>
> Cheers,
> Henry
>
>
>

--
Education 0.2 -- http://blog.ofset.org/hilaire

Stéphane Ducasse

Re: String input not in UTF-8

On Aug 5, 2011, at 4:41 PM, Hilaire Fernandes wrote:

> I gave a look at the latest XMLParser but the API is different with a
> lot broken code on my face. Does XMLWriter class>>on: obsolete ? It bugs
> me with that but the class and method are still there, a Monticello
> trick I forget about?
> I don't even now how to port to new API. Is there a port guide?
> I guess this is for the better, but still frustrating and distracting
> from the main task...

indeed
We should really invest into some main packages.
For example I worked on SOUP to add comments and add new tests.
Now we (the core) do not have the energy to work on the core and external packages.
I hope it will change when the core gets fixed.

>
>
>
> Le 05/08/2011 16:23, Henrik Johansen a écrit :
>>
>> On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote:
>>
>>> Le 05/08/2011 13:28, Henrik Johansen a écrit :
>>>>
>>>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>>>>
>>>>> It seems like when inputing accented character it is not by default in
>>>>> UTF-8.
>>>>> Is it the case with Pharo 1.3 ?
>>>>>
>>>>> Hilaire
>>>>>
>>>>>
>>>>> --
>>>>> Education 0.2 -- http://blog.ofset.org/hilaire
>>>>
>>>> I'm not sure what you mean.
>>>> When in image, all the way from InputEvents to String representation, you only deal with Unicode codePoints.
>>>
>>> Is seems it is 8 bits chars, when exported through XMLParser, it is
>>> 8bits string. I need to investigate further.
>>>
>>> Hilaire
>> It is an 8-bit character, since the codePoint fits in one byte. (see a)
>> Accented characters like é could be either:
>> a) One Unicode codepoint (U+00E9 (decimal 233) small acute e )
>> b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ).
>>
>> Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints.
>> b) would be a WideString, as 769 does not fit in a byte.
>>
>> However, if correctly converted to UTF8, their representations should be;
>> a) represented in 2 bytes ; 16r C3A9
>> b) represented in 3 bytes: 16r CD81 65.
>>
>> Ie. it seems XMLParser does not encode it properly to utf8 when exporting.
>> Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252.
>> (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such)
>> Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug.
>>
>> Cheers,
>> Henry
>>
>>
>>
>
>
> --
> Education 0.2 -- http://blog.ofset.org/hilaire
>
>