Smalltalk › Pharo › Pharo Smalltalk Developers

Decoding bug with XMLParser ?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

21 messages Options

Damien Pollet

Decoding bug with XMLParser ?

Hi,

I have a failing test to show the problem, but I can't commit to the
XMLSupport squeaksource, so I attach the MCZ here.
Basically, if I parse an UTF-8 document with an entity like …
(ellipsis), I don't get a Character with the correct #codePoint.

Cheers,

--
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

XML-Parser-DamienPollet.75.mcz (137K) Download Attachment

Alexandre Bergel

Re: Decoding bug with XMLParser ?

Hi Damien,

thanks,

I will check your code today.

cheers,
Alexandre

On 16 May 2010, at 13:35, Damien Pollet wrote:

> Hi,
>
> I have a failing test to show the problem, but I can't commit to the
> XMLSupport squeaksource, so I attach the MCZ here.
> Basically, if I parse an UTF-8 document with an entity like …
> (ellipsis), I don't get a Character with the correct #codePoint.
>
> Cheers,
>
> --
> Damien Pollet
> type less, do more [ | ] http://people.untyped.org/damien.pollet
> <XML-Parser-DamienPollet.75.mcz>

--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Damien Pollet

Re: Decoding bug with XMLParser ?

On Sun, May 16, 2010 at 22:57, Alexandre Bergel <[hidden email]> wrote:
> I will check your code today.

Thanks.

Thinking of it, it's not really an encoding problem, rather a bug in
the entity->character conversion. I guess there should be a similar
test where there is an actual ellipsis character in the xml, instead
of the entity.

For the context, I was trying to use the twitter component in pier, to
display http://twitter.com/rmod_inria (or the XML feed file, which has
the ellipsis character literally as well).
And now I realize our server will not be able to connect outside its
DMZ, so I won't be able to use the fix :D

> On 16 May 2010, at 13:35, Damien Pollet wrote:
>
>> Hi,
>>
>> I have a failing test to show the problem, but I can't commit to the
>> XMLSupport squeaksource, so I attach the MCZ here.
>> Basically, if I parse an UTF-8 document with an entity like …
>> (ellipsis), I don't get a Character with the correct #codePoint.
>>
>> Cheers,
>>
>> --
>> Damien Pollet
>> type less, do more [ | ] http://people.untyped.org/damien.pollet
>> <XML-Parser-DamienPollet.75.mcz>
>
> --
> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
> Alexandre Bergel http://www.bergel.eu
> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>
>
>
>
>
>

--
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Alexandre Bergel-4

Re: Decoding bug with XMLParser ?

To give a bit of context, the problem is:

-=-=-=-=-=-=-=-=-=-=-=-=
exampleEncodedXML
^'<?xml version="1.0" encoding="UTF-8"?>
<test-data>…</test-data>
'

testDecodingCharacters
| xmlDocument element |
"XMLTokenizer testDecodingCharacters"

xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream.
element := xmlDocument firstTagNamed: #'test-data'.

self assert: element contentString first codePoint = 8230
-=-=-=-=-=-=-=-=-=-=-=-=

#testDecodingCharacters goes yellow

> Thinking of it, it's not really an encoding problem, rather a bug in
> the entity->character conversion. I guess there should be a similar
> test where there is an actual ellipsis character in the xml, instead
> of the entity.

Any idea how your test can goes green?

> And now I realize our server will not be able to connect outside its
> DMZ, so I won't be able to use the fix :D

DMZ ?

Cheers,
Alexandre

>
>
>
>
>> On 16 May 2010, at 13:35, Damien Pollet wrote:
>>
>>> Hi,
>>>
>>> I have a failing test to show the problem, but I can't commit to the
>>> XMLSupport squeaksource, so I attach the MCZ here.
>>> Basically, if I parse an UTF-8 document with an entity like …
>>> (ellipsis), I don't get a Character with the correct #codePoint.
>>>
>>> Cheers,
>>>
>>> --
>>> Damien Pollet
>>> type less, do more [ | ] http://people.untyped.org/damien.pollet
>>> <XML-Parser-DamienPollet.75.mcz>
>>
>> --
>> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
>> Alexandre Bergel http://www.bergel.eu
>> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>>
>>
>>
>>
>>
>>
>
>
>
> --
> Damien Pollet
> type less, do more [ | ] http://people.untyped.org/damien.pollet

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

jaayer

Fwd: Re: Decoding bug with XMLParser ?

============ Forwarded message ============
From : jaayer<[hidden email]>
To : <[hidden email]>
Date : Tue, 18 May 2010 16:30:06 -0700
Subject : Re: Decoding bug with XMLParser ?
============ Forwarded message ============

---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[hidden email]> wrote ----

>To give a bit of context, the problem is:
>
>-=-=-=-=-=-=-=-=-=-=-=-=
>exampleEncodedXML
>    ^'<?xml version="1.0" encoding="UTF-8"?>
><test-data>…</test-data>
>'
>
>testDecodingCharacters
>    | xmlDocument element |
>    "XMLTokenizer testDecodingCharacters"
>
>    xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream.
>    element := xmlDocument firstTagNamed: #'test-data'.
>
>    self assert: element contentString first codePoint = 8230
>-=-=-=-=-=-=-=-=-=-=-=-=
>
>#testDecodingCharacters goes yellow
>
>> Thinking of it, it's not really an encoding problem, rather a bug in
>> the entity->character conversion. I guess there should be a similar
>> test where there is an actual ellipsis character in the xml, instead
>> of the entity.
>
>Any idea how your test can goes green?
>
>> And now I realize our server will not be able to connect outside its
>> DMZ, so I won't be able to use the fix :D
>
>DMZ ?
>
>Cheers,
>Alexandre
>

Levente Uzonyi-2

Re: Fwd: Re: Decoding bug with XMLParser ?

On Tue, 18 May 2010, jaayer wrote:

>

============ Forwarded message ============
From : jaayer<[hidden email]>
To : <[hidden email]>
Date : Tue, 18 May 2010 16:30:06 -0700
Subject : Re: Decoding bug with XMLParser ?
============ Forwarded message ============

---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[hidden email]> wrote ----

Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support.

(I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.)

Another "hard to quote" message, but I hope my answer will be clear.
The "problem" is that in Pharo the leadingChar for unicode characters is
still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
(Unicode value: 8230) codePoint. "===> 8230"

While in Pharo it's:
(Unicode value: 8230) codePoint. "===> 1069555750"
(Character value: 1069555750) charCode. "===> 8230"
(Character value: 1069555750) leadingChar. "===> 255"

So using #charCode instead of #codePoint is the solution.

Levente

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Henrik Sperre Johansen

Re: Fwd: Re: Decoding bug with XMLParser ?

In reply to this post by jaayer

On 19.05.2010 02:17, jaayer wrote:

>
> ============ Forwarded message ============
> From : jaayer<[hidden email]>
> To :<[hidden email]>
> Date : Tue, 18 May 2010 16:30:06 -0700
> Subject : Re: Decoding bug with XMLParser ?
> ============ Forwarded message ============
>
> ---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel<[hidden email]> wrote ----
>
>> To give a bit of context, the problem is:
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=
>> exampleEncodedXML
>> ^'<?xml version="1.0" encoding="UTF-8"?>
>> <test-data>…</test-data>
>> '
>>
>> testDecodingCharacters
>> | xmlDocument element |
>> "XMLTokenizer testDecodingCharacters"
>>
>> xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream.
>> element := xmlDocument firstTagNamed: #'test-data'.
>>
>> self assert: element contentString first codePoint = 8230
>> -=-=-=-=-=-=-=-=-=-=-=-=
>>
>> #testDecodingCharacters goes yellow
>>
>>> Thinking of it, it's not really an encoding problem, rather a bug in
>>> the entity->character conversion. I guess there should be a similar
>>> test where there is an actual ellipsis character in the xml, instead
>>> of the entity.
>> Any idea how your test can goes green?
>>
>>> And now I realize our server will not be able to connect outside its
>>> DMZ, so I won't be able to use the fix :D
>> DMZ ?
>>
>> Cheers,
>> Alexandre
>>
> Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support.
>
> (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.)

codePoint returns the raw value, which includes the leadingChar used to
differentiate between different locale interpretations of the same
character.
In 1.0 this was 255 for WideCharacters, in 1.1 it has been changed to 0.
ie, using codePoint in the test is erroneous, for a method which returns
what you expect in both 1.0 and 1.1, use asUnicode.

Cheers,
Henry

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Damien Pollet

Re: Fwd: Re: Decoding bug with XMLParser ?

In reply to this post by Levente Uzonyi-2

2010/5/19 Levente Uzonyi <[hidden email]>:

> Another "hard to quote" message, but I hope my answer will be clear.
> The "problem" is that in Pharo the leadingChar for unicode characters is
> still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
> (Unicode value: 8230) codePoint. "===> 8230"
>
> While in Pharo it's:
> (Unicode value: 8230) codePoint. "===> 1069555750"
> (Character value: 1069555750) charCode. "===> 8230"
> (Character value: 1069555750) leadingChar. "===> 255"
>
> So using #charCode instead of #codePoint is the solution.

What about updating the leadingChar in Pharo to match Squeak? (I know
it's not the correct solution to the present problem but it's these
kinds of sneaky differences between platform that make life difficult)

What's the semantic difference between picking 0 or 255? Is one more
correct than the other?

--
Damien Pollet
type less, do more [ | ] http://people.untyped.org/damien.pollet

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

In reply to this post by Levente Uzonyi-2

> Another "hard to quote" message, but I hope my answer will be clear.
> The "problem" is that in Pharo the leadingChar for unicode characters is still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
> (Unicode value: 8230) codePoint. "===> 8230"
>
> While in Pharo it's:
> (Unicode value: 8230) codePoint. "===> 1069555750"
> (Character value: 1069555750) charCode. "===> 8230"
> (Character value: 1069555750) leadingChar. "===> 255"
>
> So using #charCode instead of #codePoint is the solution.

Thanks levente.
In Pharo 1.1 we get the same as in squeak.
> (Unicode value: 8230) codePoint. "===> 1069555750"
> (Character value: 1069555750) charCode. "===> 8230"
> (Character value: 1069555750) leadingChar. "===> 255"

0 locale encoded by 0 is Unicode

initialize

self allSubclassesDo: [:each | each initialize].

EncodedCharSets := Array new: 256.

EncodedCharSets at: 0+1 put: Unicode "Latin1Environment".
EncodedCharSets at: 1+1 put: JISX0208.
EncodedCharSets at: 2+1 put: GB2312.
EncodedCharSets at: 3+1 put: KSX1001.
EncodedCharSets at: 4+1 put: JISX0208.
EncodedCharSets at: 5+1 put: JapaneseEnvironment.
EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
EncodedCharSets at: 7+1 put: KoreanEnvironment.
EncodedCharSets at: 8+1 put: GB2312.
"EncodedCharSets at: 9+1 put: UnicodeTraditionalChinese."
"EncodedCharSets at: 10+1 put: UnicodeVietnamese."
EncodedCharSets at: 12+1 put: KSX1001.
EncodedCharSets at: 13+1 put: GreekEnvironment.
EncodedCharSets at: 14+1 put: Latin2Environment.
EncodedCharSets at: 15+1 put: RussianEnvironment.
EncodedCharSets at: 16+1 put: NepaleseEnvironment.
EncodedCharSets at: 256 put: Unicode.

>

Now I was wondering why in squeak and pharo
(Character value: 1069555750) leadingChar. "===> 255"

stef

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

In reply to this post by jaayer

excellent! Let us know this is good to get more support on the XML/DTD part.

Stef

On May 19, 2010, at 2:17 AM, jaayer wrote:

>
>
> ============ Forwarded message ============
> From : jaayer<[hidden email]>
> To : <[hidden email]>
> Date : Tue, 18 May 2010 16:30:06 -0700
> Subject : Re: Decoding bug with XMLParser ?
> ============ Forwarded message ============
>
> ---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[hidden email]> wrote ----
>
>> To give a bit of context, the problem is:
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=
>> exampleEncodedXML
>> ^'<?xml version="1.0" encoding="UTF-8"?>
>> <test-data>…</test-data>
>> '
>>
>> testDecodingCharacters
>> | xmlDocument element |
>> "XMLTokenizer testDecodingCharacters"
>>
>> xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream.
>> element := xmlDocument firstTagNamed: #'test-data'.
>>
>> self assert: element contentString first codePoint = 8230
>> -=-=-=-=-=-=-=-=-=-=-=-=
>>
>> #testDecodingCharacters goes yellow
>>
>>> Thinking of it, it's not really an encoding problem, rather a bug in
>>> the entity->character conversion. I guess there should be a similar
>>> test where there is an actual ellipsis character in the xml, instead
>>> of the entity.
>>
>> Any idea how your test can goes green?
>>
>>> And now I realize our server will not be able to connect outside its
>>> DMZ, so I won't be able to use the fix :D
>>
>> DMZ ?
>>
>> Cheers,
>> Alexandre
>>
>
> Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support.
>
> (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.)
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Levente Uzonyi-2

Re: Fwd: Re: Decoding bug with XMLParser ?

In reply to this post by Stéphane Ducasse

On Wed, 19 May 2010, Stéphane Ducasse wrote:

>> Another "hard to quote" message, but I hope my answer will be clear.
>> The "problem" is that in Pharo the leadingChar for unicode characters is still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
>> (Unicode value: 8230) codePoint. "===> 8230"
>>
>> While in Pharo it's:
>> (Unicode value: 8230) codePoint. "===> 1069555750"
>> (Character value: 1069555750) charCode. "===> 8230"
>> (Character value: 1069555750) leadingChar. "===> 255"
>>
>> So using #charCode instead of #codePoint is the solution.
>
> Thanks levente.
> In Pharo 1.1 we get the same as in squeak.
>> (Unicode value: 8230) codePoint. "===> 1069555750"
>> (Character value: 1069555750) charCode. "===> 8230"
>> (Character value: 1069555750) leadingChar. "===> 255"
>
>
>
> 0 locale encoded by 0 is Unicode
>
> initialize
>
> self allSubclassesDo: [:each | each initialize].
>
> EncodedCharSets := Array new: 256.
>
> EncodedCharSets at: 0+1 put: Unicode "Latin1Environment".
> EncodedCharSets at: 1+1 put: JISX0208.
> EncodedCharSets at: 2+1 put: GB2312.
> EncodedCharSets at: 3+1 put: KSX1001.
> EncodedCharSets at: 4+1 put: JISX0208.
> EncodedCharSets at: 5+1 put: JapaneseEnvironment.
> EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment.
> EncodedCharSets at: 7+1 put: KoreanEnvironment.
> EncodedCharSets at: 8+1 put: GB2312.
> "EncodedCharSets at: 9+1 put: UnicodeTraditionalChinese."
> "EncodedCharSets at: 10+1 put: UnicodeVietnamese."
> EncodedCharSets at: 12+1 put: KSX1001.
> EncodedCharSets at: 13+1 put: GreekEnvironment.
> EncodedCharSets at: 14+1 put: Latin2Environment.
> EncodedCharSets at: 15+1 put: RussianEnvironment.
> EncodedCharSets at: 16+1 put: NepaleseEnvironment.
> EncodedCharSets at: 256 put: Unicode.
>
>>
>
> Now I was wondering why in squeak and pharo
> (Character value: 1069555750) leadingChar. "===> 255"

Because the 22 least significant bits represent the #charCode and the
rest (8 bits) are the #leadingChar. So 1069555750 means that #charCode
is 8230 and the #leadingChar is 255.

Levente

>
> stef
>
>
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Alexandre Bergel-4

Re: Fwd: Re: Decoding bug with XMLParser ?

>>
>> Now I was wondering why in squeak and pharo
>> (Character value: 1069555750) leadingChar. "===> 255"
>
> Because the 22 least significant bits represent the #charCode and the rest (8 bits) are the #leadingChar. So 1069555750 means that #charCode is 8230 and the #leadingChar is 255.

Pfff... Remind me the old assembly lecture :-)

Thank you all for your comment. I will fix the test.

Alexandre

>
>
> Levente
>
>>
>> stef
>>
>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

I always liked 68000.... when I imagine my codev implemented our game in overall 300000 assembly line of code
and also using 6502 with 3 registers and one specific...
68000 was a dream :)

Stef

On May 19, 2010, at 12:28 PM, Alexandre Bergel wrote:

>>>
>>> Now I was wondering why in squeak and pharo
>>> (Character value: 1069555750) leadingChar. "===> 255"
>>
>> Because the 22 least significant bits represent the #charCode and the rest (8 bits) are the #leadingChar. So 1069555750 means that #charCode is 8230 and the #leadingChar is 255.
>
> Pfff... Remind me the old assembly lecture :-)
>
> Thank you all for your comment. I will fix the test.
>
> Alexandre
>
>
>>
>>
>> Levente
>>
>>>
>>> stef
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2-3

Re: Fwd: Re: Decoding bug with XMLParser ?

In reply to this post by Damien Pollet

On 19.05.2010 07:58, Damien Pollet wrote:

> 2010/5/19 Levente Uzonyi <[hidden email]>:
>> Another "hard to quote" message, but I hope my answer will be clear.
>> The "problem" is that in Pharo the leadingChar for unicode characters is
>> still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1:
>> (Unicode value: 8230) codePoint. "===> 8230"
>>
>> While in Pharo it's:
>> (Unicode value: 8230) codePoint. "===> 1069555750"
>> (Character value: 1069555750) charCode. "===> 8230"
>> (Character value: 1069555750) leadingChar. "===> 255"
>>
>> So using #charCode instead of #codePoint is the solution.
>
> What about updating the leadingChar in Pharo to match Squeak? (I know
> it's not the correct solution to the present problem but it's these
> kinds of sneaky differences between platform that make life difficult)
>
> What's the semantic difference between picking 0 or 255? Is one more
> correct than the other?
>

Currently the semantics in Pharo are
0: Latin 1
255: Unicode

which is fun because the first 255 characters are interned and you
therefore can't change their leadingChar. So if you're using Unicode,
you're forced to mix leachingChars.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

>>
>>
>
> Currently the semantics in Pharo are
> 0: Latin 1
> 255: Unicode
>
> which is fun because the first 255 characters are interned and you
> therefore can't change their leadingChar. So if you're using Unicode,
> you're forced to mix leachingChars.

so what do you suggest to have unicode as zero?

Stef

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2-3

Re: Fwd: Re: Decoding bug with XMLParser ?

On 19.05.2010 19:35, Stéphane Ducasse wrote:

>>>
>>>
>>
>> Currently the semantics in Pharo are
>> 0: Latin 1
>> 255: Unicode
>>
>> which is fun because the first 255 characters are interned and you
>> therefore can't change their leadingChar. So if you're using Unicode,
>> you're forced to mix leachingChars.
>
> so what do you suggest to have unicode as zero?

Yes, the decision of Squeak 4.1 to make Unicode leadingChar 0 is an
improvement.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

ok
may be you should have said that before :)
Now how this can be done?
changing the EncodedCharSets?

> initialize
>
> self allSubclassesDo: [:each | each initialize].
>
> EncodedCharSets := Array new: 256.
>
> EncodedCharSets at: 0+1 put: Unicode "Latin1Environment".
> EncodedCharSets at: 1+1 put: JISX0208.
> EncodedCharSets at: 2+1 put: GB2312.
> EncodedCharSets at: 3+1 put: KSX1001.
> EncodedCharSets at: 4+1 put: JISX0208.

Then I do not understand because
EncodedCharSets at: 0+1 put: Unicode
seems to me that this is already the case but may be I'm not looking at the right place.

What are the implications?
What will we break.

Stef

On May 19, 2010, at 9:06 PM, Philippe Marschall wrote:

> On 19.05.2010 19:35, Stéphane Ducasse wrote:
>>>>
>>>>
>>>
>>> Currently the semantics in Pharo are
>>> 0: Latin 1
>>> 255: Unicode
>>>
>>> which is fun because the first 255 characters are interned and you
>>> therefore can't change their leadingChar. So if you're using Unicode,
>>> you're forced to mix leachingChars.
>>
>> so what do you suggest to have unicode as zero?
>
> Yes, the decision of Squeak 4.1 to make Unicode leadingChar 0 is an
> improvement.
>
> Cheers
> Philippe
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Henrik Sperre Johansen

Re: Fwd: Re: Decoding bug with XMLParser ?

On 19.05.2010 21:44, Stéphane Ducasse wrote:

> ok
> may be you should have said that before :)
> Now how this can be done?
> changing the EncodedCharSets?
>
>> initialize
>>
>> self allSubclassesDo: [:each | each initialize].
>>
>> EncodedCharSets := Array new: 256.
>>
>> EncodedCharSets at: 0+1 put: Unicode "Latin1Environment".
>> EncodedCharSets at: 1+1 put: JISX0208.
>> EncodedCharSets at: 2+1 put: GB2312.
>> EncodedCharSets at: 3+1 put: KSX1001.
>> EncodedCharSets at: 4+1 put: JISX0208.
> Then I do not understand because
> EncodedCharSets at: 0+1 put: Unicode
> seems to me that this is already the case but may be I'm not looking at the right place.
>
> What are the implications?
> What will we break.
>
> Stef

As you've noted, we've already done it.
It's broken then display of WideStrings with StrikeFonts, since that
relied on Wide Characters being stops, which did happen when leadingChar
was 255, but not when it's 0.
That is manifested in that the rest of the string after a WideChar is
rendered as ?'s.
There's also an error causing the width of the chars to be wrong, and
the first char on the line to be displayed on the previous line instead,
but iirc that's not directly related to the stops issue.

Cheers,
Henry

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

Thanks for helping me there because I was really like "what?" but not totally confused yet.

On May 19, 2010, at 11:52 PM, Henrik Sperre Johansen wrote:

> On 19.05.2010 21:44, Stéphane Ducasse wrote:
>> ok
>> may be you should have said that before :)
>> Now how this can be done?
>> changing the EncodedCharSets?
>>
>>> initialize
>>>
>>> self allSubclassesDo: [:each | each initialize].
>>>
>>> EncodedCharSets := Array new: 256.
>>>
>>> EncodedCharSets at: 0+1 put: Unicode "Latin1Environment".
>>> EncodedCharSets at: 1+1 put: JISX0208.
>>> EncodedCharSets at: 2+1 put: GB2312.
>>> EncodedCharSets at: 3+1 put: KSX1001.
>>> EncodedCharSets at: 4+1 put: JISX0208.
>> Then I do not understand because
>> EncodedCharSets at: 0+1 put: Unicode
>> seems to me that this is already the case but may be I'm not looking at the right place.
>>
>> What are the implications?
>> What will we break.
>>
>> Stef
> As you've noted, we've already done it.
> It's broken then display of WideStrings with StrikeFonts, since that relied on Wide Characters being stops, which did happen when leadingChar was 255, but not when it's 0.
> That is manifested in that the rest of the string after a WideChar is rendered as ?'s.
> There's also an error causing the width of the chars to be wrong, and the first char on the line to be displayed on the previous line instead, but iirc that's not directly related to the stops issue.
>
> Cheers,
> Henry
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: Fwd: Re: Decoding bug with XMLParser ?

http://code.google.com/p/pharo/issues/detail?id=2448

>>>>
>> As you've noted, we've already done it.
>> It's broken then display of WideStrings with StrikeFonts, since that relied on Wide Characters being stops, which did happen when leadingChar was 255, but not when it's 0.
>> That is manifested in that the rest of the string after a WideChar is rendered as ?'s.
>> There's also an error causing the width of the chars to be wrong, and the first char on the line to be displayed on the previous line instead, but iirc that's not directly related to the stops issue.
>>
>> Cheers,
>> Henry

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project