Hi,
I have a failing test to show the problem, but I can't commit to the XMLSupport squeaksource, so I attach the MCZ here. Basically, if I parse an UTF-8 document with an entity like … (ellipsis), I don't get a Character with the correct #codePoint. Cheers, -- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project XML-Parser-DamienPollet.75.mcz (137K) Download Attachment |
Hi Damien,
thanks, I will check your code today. cheers, Alexandre On 16 May 2010, at 13:35, Damien Pollet wrote: > Hi, > > I have a failing test to show the problem, but I can't commit to the > XMLSupport squeaksource, so I attach the MCZ here. > Basically, if I parse an UTF-8 document with an entity like … > (ellipsis), I don't get a Character with the correct #codePoint. > > Cheers, > > -- > Damien Pollet > type less, do more [ | ] http://people.untyped.org/damien.pollet > <XML-Parser-DamienPollet.75.mcz> -- _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On Sun, May 16, 2010 at 22:57, Alexandre Bergel <[hidden email]> wrote:
> I will check your code today. Thanks. Thinking of it, it's not really an encoding problem, rather a bug in the entity->character conversion. I guess there should be a similar test where there is an actual ellipsis character in the xml, instead of the entity. For the context, I was trying to use the twitter component in pier, to display http://twitter.com/rmod_inria (or the XML feed file, which has the ellipsis character literally as well). And now I realize our server will not be able to connect outside its DMZ, so I won't be able to use the fix :D > On 16 May 2010, at 13:35, Damien Pollet wrote: > >> Hi, >> >> I have a failing test to show the problem, but I can't commit to the >> XMLSupport squeaksource, so I attach the MCZ here. >> Basically, if I parse an UTF-8 document with an entity like … >> (ellipsis), I don't get a Character with the correct #codePoint. >> >> Cheers, >> >> -- >> Damien Pollet >> type less, do more [ | ] http://people.untyped.org/damien.pollet >> <XML-Parser-DamienPollet.75.mcz> > > -- > _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: > Alexandre Bergel http://www.bergel.eu > ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. > > > > > > -- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
To give a bit of context, the problem is:
-=-=-=-=-=-=-=-=-=-=-=-= exampleEncodedXML ^'<?xml version="1.0" encoding="UTF-8"?> <test-data>…</test-data> ' testDecodingCharacters | xmlDocument element | "XMLTokenizer testDecodingCharacters" xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream. element := xmlDocument firstTagNamed: #'test-data'. self assert: element contentString first codePoint = 8230 -=-=-=-=-=-=-=-=-=-=-=-= #testDecodingCharacters goes yellow > Thinking of it, it's not really an encoding problem, rather a bug in > the entity->character conversion. I guess there should be a similar > test where there is an actual ellipsis character in the xml, instead > of the entity. Any idea how your test can goes green? > And now I realize our server will not be able to connect outside its > DMZ, so I won't be able to use the fix :D DMZ ? Cheers, Alexandre > > > > >> On 16 May 2010, at 13:35, Damien Pollet wrote: >> >>> Hi, >>> >>> I have a failing test to show the problem, but I can't commit to the >>> XMLSupport squeaksource, so I attach the MCZ here. >>> Basically, if I parse an UTF-8 document with an entity like … >>> (ellipsis), I don't get a Character with the correct #codePoint. >>> >>> Cheers, >>> >>> -- >>> Damien Pollet >>> type less, do more [ | ] http://people.untyped.org/damien.pollet >>> <XML-Parser-DamienPollet.75.mcz> >> >> -- >> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: >> Alexandre Bergel http://www.bergel.eu >> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. >> >> >> >> >> >> > > > > -- > Damien Pollet > type less, do more [ | ] http://people.untyped.org/damien.pollet _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
============ Forwarded message ============ From : jaayer<[hidden email]> To : <[hidden email]> Date : Tue, 18 May 2010 16:30:06 -0700 Subject : Re: Decoding bug with XMLParser ? ============ Forwarded message ============ ---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[hidden email]> wrote ---- >To give a bit of context, the problem is: > >-=-=-=-=-=-=-=-=-=-=-=-= >exampleEncodedXML > ^'<?xml version="1.0" encoding="UTF-8"?> ><test-data>…</test-data> >' > >testDecodingCharacters > | xmlDocument element | > "XMLTokenizer testDecodingCharacters" > > xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream. > element := xmlDocument firstTagNamed: #'test-data'. > > self assert: element contentString first codePoint = 8230 >-=-=-=-=-=-=-=-=-=-=-=-= > >#testDecodingCharacters goes yellow > >> Thinking of it, it's not really an encoding problem, rather a bug in >> the entity->character conversion. I guess there should be a similar >> test where there is an actual ellipsis character in the xml, instead >> of the entity. > >Any idea how your test can goes green? > >> And now I realize our server will not be able to connect outside its >> DMZ, so I won't be able to use the fix :D > >DMZ ? > >Cheers, >Alexandre > Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support. (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.) _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On Tue, 18 May 2010, jaayer wrote:
> ============ Forwarded message ============ From : jaayer<[hidden email]> To : <[hidden email]> Date : Tue, 18 May 2010 16:30:06 -0700 Subject : Re: Decoding bug with XMLParser ? ============ Forwarded message ============ ---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[hidden email]> wrote ---- >To give a bit of context, the problem is: > >-=-=-=-=-=-=-=-=-=-=-=-= >exampleEncodedXML > ^'<?xml version="1.0" encoding="UTF-8"?> ><test-data>…</test-data> >' > >testDecodingCharacters > | xmlDocument element | > "XMLTokenizer testDecodingCharacters" > > xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream. > element := xmlDocument firstTagNamed: #'test-data'. > > self assert: element contentString first codePoint = 8230 >-=-=-=-=-=-=-=-=-=-=-=-= > >#testDecodingCharacters goes yellow > >> Thinking of it, it's not really an encoding problem, rather a bug in >> the entity->character conversion. I guess there should be a similar >> test where there is an actual ellipsis character in the xml, instead >> of the entity. > >Any idea how your test can goes green? > >> And now I realize our server will not be able to connect outside its >> DMZ, so I won't be able to use the fix :D > >DMZ ? > >Cheers, >Alexandre > (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.) Another "hard to quote" message, but I hope my answer will be clear. The "problem" is that in Pharo the leadingChar for unicode characters is still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1: (Unicode value: 8230) codePoint. "===> 8230" While in Pharo it's: (Unicode value: 8230) codePoint. "===> 1069555750" (Character value: 1069555750) charCode. "===> 8230" (Character value: 1069555750) leadingChar. "===> 255" So using #charCode instead of #codePoint is the solution. Levente _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by jaayer
On 19.05.2010 02:17, jaayer wrote:
> > ============ Forwarded message ============ > From : jaayer<[hidden email]> > To :<[hidden email]> > Date : Tue, 18 May 2010 16:30:06 -0700 > Subject : Re: Decoding bug with XMLParser ? > ============ Forwarded message ============ > > ---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel<[hidden email]> wrote ---- > >> To give a bit of context, the problem is: >> >> -=-=-=-=-=-=-=-=-=-=-=-= >> exampleEncodedXML >> ^'<?xml version="1.0" encoding="UTF-8"?> >> <test-data>…</test-data> >> ' >> >> testDecodingCharacters >> | xmlDocument element | >> "XMLTokenizer testDecodingCharacters" >> >> xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream. >> element := xmlDocument firstTagNamed: #'test-data'. >> >> self assert: element contentString first codePoint = 8230 >> -=-=-=-=-=-=-=-=-=-=-=-= >> >> #testDecodingCharacters goes yellow >> >>> Thinking of it, it's not really an encoding problem, rather a bug in >>> the entity->character conversion. I guess there should be a similar >>> test where there is an actual ellipsis character in the xml, instead >>> of the entity. >> Any idea how your test can goes green? >> >>> And now I realize our server will not be able to connect outside its >>> DMZ, so I won't be able to use the fix :D >> DMZ ? >> >> Cheers, >> Alexandre >> > Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support. > > (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.) differentiate between different locale interpretations of the same character. In 1.0 this was 255 for WideCharacters, in 1.1 it has been changed to 0. ie, using codePoint in the test is erroneous, for a method which returns what you expect in both 1.0 and 1.1, use asUnicode. Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Levente Uzonyi-2
2010/5/19 Levente Uzonyi <[hidden email]>:
> Another "hard to quote" message, but I hope my answer will be clear. > The "problem" is that in Pharo the leadingChar for unicode characters is > still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1: > (Unicode value: 8230) codePoint. "===> 8230" > > While in Pharo it's: > (Unicode value: 8230) codePoint. "===> 1069555750" > (Character value: 1069555750) charCode. "===> 8230" > (Character value: 1069555750) leadingChar. "===> 255" > > So using #charCode instead of #codePoint is the solution. What about updating the leadingChar in Pharo to match Squeak? (I know it's not the correct solution to the present problem but it's these kinds of sneaky differences between platform that make life difficult) What's the semantic difference between picking 0 or 255? Is one more correct than the other? -- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Levente Uzonyi-2
> Another "hard to quote" message, but I hope my answer will be clear.
> The "problem" is that in Pharo the leadingChar for unicode characters is still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1: > (Unicode value: 8230) codePoint. "===> 8230" > > While in Pharo it's: > (Unicode value: 8230) codePoint. "===> 1069555750" > (Character value: 1069555750) charCode. "===> 8230" > (Character value: 1069555750) leadingChar. "===> 255" > > So using #charCode instead of #codePoint is the solution. Thanks levente. In Pharo 1.1 we get the same as in squeak. > (Unicode value: 8230) codePoint. "===> 1069555750" > (Character value: 1069555750) charCode. "===> 8230" > (Character value: 1069555750) leadingChar. "===> 255" 0 locale encoded by 0 is Unicode initialize self allSubclassesDo: [:each | each initialize]. EncodedCharSets := Array new: 256. EncodedCharSets at: 0+1 put: Unicode "Latin1Environment". EncodedCharSets at: 1+1 put: JISX0208. EncodedCharSets at: 2+1 put: GB2312. EncodedCharSets at: 3+1 put: KSX1001. EncodedCharSets at: 4+1 put: JISX0208. EncodedCharSets at: 5+1 put: JapaneseEnvironment. EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment. EncodedCharSets at: 7+1 put: KoreanEnvironment. EncodedCharSets at: 8+1 put: GB2312. "EncodedCharSets at: 9+1 put: UnicodeTraditionalChinese." "EncodedCharSets at: 10+1 put: UnicodeVietnamese." EncodedCharSets at: 12+1 put: KSX1001. EncodedCharSets at: 13+1 put: GreekEnvironment. EncodedCharSets at: 14+1 put: Latin2Environment. EncodedCharSets at: 15+1 put: RussianEnvironment. EncodedCharSets at: 16+1 put: NepaleseEnvironment. EncodedCharSets at: 256 put: Unicode. > Now I was wondering why in squeak and pharo (Character value: 1069555750) leadingChar. "===> 255" stef _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by jaayer
excellent! Let us know this is good to get more support on the XML/DTD part.
Stef On May 19, 2010, at 2:17 AM, jaayer wrote: > > > ============ Forwarded message ============ > From : jaayer<[hidden email]> > To : <[hidden email]> > Date : Tue, 18 May 2010 16:30:06 -0700 > Subject : Re: Decoding bug with XMLParser ? > ============ Forwarded message ============ > > ---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[hidden email]> wrote ---- > >> To give a bit of context, the problem is: >> >> -=-=-=-=-=-=-=-=-=-=-=-= >> exampleEncodedXML >> ^'<?xml version="1.0" encoding="UTF-8"?> >> <test-data>…</test-data> >> ' >> >> testDecodingCharacters >> | xmlDocument element | >> "XMLTokenizer testDecodingCharacters" >> >> xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML readStream. >> element := xmlDocument firstTagNamed: #'test-data'. >> >> self assert: element contentString first codePoint = 8230 >> -=-=-=-=-=-=-=-=-=-=-=-= >> >> #testDecodingCharacters goes yellow >> >>> Thinking of it, it's not really an encoding problem, rather a bug in >>> the entity->character conversion. I guess there should be a similar >>> test where there is an actual ellipsis character in the xml, instead >>> of the entity. >> >> Any idea how your test can goes green? >> >>> And now I realize our server will not be able to connect outside its >>> DMZ, so I won't be able to use the fix :D >> >> DMZ ? >> >> Cheers, >> Alexandre >> > > Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support. > > (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.) > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Stéphane Ducasse
On Wed, 19 May 2010, Stéphane Ducasse wrote:
>> Another "hard to quote" message, but I hope my answer will be clear. >> The "problem" is that in Pharo the leadingChar for unicode characters is still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1: >> (Unicode value: 8230) codePoint. "===> 8230" >> >> While in Pharo it's: >> (Unicode value: 8230) codePoint. "===> 1069555750" >> (Character value: 1069555750) charCode. "===> 8230" >> (Character value: 1069555750) leadingChar. "===> 255" >> >> So using #charCode instead of #codePoint is the solution. > > Thanks levente. > In Pharo 1.1 we get the same as in squeak. >> (Unicode value: 8230) codePoint. "===> 1069555750" >> (Character value: 1069555750) charCode. "===> 8230" >> (Character value: 1069555750) leadingChar. "===> 255" > > > > 0 locale encoded by 0 is Unicode > > initialize > > self allSubclassesDo: [:each | each initialize]. > > EncodedCharSets := Array new: 256. > > EncodedCharSets at: 0+1 put: Unicode "Latin1Environment". > EncodedCharSets at: 1+1 put: JISX0208. > EncodedCharSets at: 2+1 put: GB2312. > EncodedCharSets at: 3+1 put: KSX1001. > EncodedCharSets at: 4+1 put: JISX0208. > EncodedCharSets at: 5+1 put: JapaneseEnvironment. > EncodedCharSets at: 6+1 put: SimplifiedChineseEnvironment. > EncodedCharSets at: 7+1 put: KoreanEnvironment. > EncodedCharSets at: 8+1 put: GB2312. > "EncodedCharSets at: 9+1 put: UnicodeTraditionalChinese." > "EncodedCharSets at: 10+1 put: UnicodeVietnamese." > EncodedCharSets at: 12+1 put: KSX1001. > EncodedCharSets at: 13+1 put: GreekEnvironment. > EncodedCharSets at: 14+1 put: Latin2Environment. > EncodedCharSets at: 15+1 put: RussianEnvironment. > EncodedCharSets at: 16+1 put: NepaleseEnvironment. > EncodedCharSets at: 256 put: Unicode. > >> > > Now I was wondering why in squeak and pharo > (Character value: 1069555750) leadingChar. "===> 255" rest (8 bits) are the #leadingChar. So 1069555750 means that #charCode is 8230 and the #leadingChar is 255. Levente > > stef > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
>>
>> Now I was wondering why in squeak and pharo >> (Character value: 1069555750) leadingChar. "===> 255" > > Because the 22 least significant bits represent the #charCode and the rest (8 bits) are the #leadingChar. So 1069555750 means that #charCode is 8230 and the #leadingChar is 255. Pfff... Remind me the old assembly lecture :-) Thank you all for your comment. I will fix the test. Alexandre > > > Levente > >> >> stef >> >> >> >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
I always liked 68000.... when I imagine my codev implemented our game in overall 300000 assembly line of code
and also using 6502 with 3 registers and one specific... 68000 was a dream :) Stef On May 19, 2010, at 12:28 PM, Alexandre Bergel wrote: >>> >>> Now I was wondering why in squeak and pharo >>> (Character value: 1069555750) leadingChar. "===> 255" >> >> Because the 22 least significant bits represent the #charCode and the rest (8 bits) are the #leadingChar. So 1069555750 means that #charCode is 8230 and the #leadingChar is 255. > > Pfff... Remind me the old assembly lecture :-) > > Thank you all for your comment. I will fix the test. > > Alexandre > > >> >> >> Levente >> >>> >>> stef >>> >>> >>> >>> >>> _______________________________________________ >>> Pharo-project mailing list >>> [hidden email] >>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Damien Pollet
On 19.05.2010 07:58, Damien Pollet wrote:
> 2010/5/19 Levente Uzonyi <[hidden email]>: >> Another "hard to quote" message, but I hope my answer will be clear. >> The "problem" is that in Pharo the leadingChar for unicode characters is >> still 255. This was changed in Squeak 4.1 to 0. So in Squeak 4.1: >> (Unicode value: 8230) codePoint. "===> 8230" >> >> While in Pharo it's: >> (Unicode value: 8230) codePoint. "===> 1069555750" >> (Character value: 1069555750) charCode. "===> 8230" >> (Character value: 1069555750) leadingChar. "===> 255" >> >> So using #charCode instead of #codePoint is the solution. > > What about updating the leadingChar in Pharo to match Squeak? (I know > it's not the correct solution to the present problem but it's these > kinds of sneaky differences between platform that make life difficult) > > What's the semantic difference between picking 0 or 255? Is one more > correct than the other? > Currently the semantics in Pharo are 0: Latin 1 255: Unicode which is fun because the first 255 characters are interned and you therefore can't change their leadingChar. So if you're using Unicode, you're forced to mix leachingChars. Cheers Philippe _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
>>
>> > > Currently the semantics in Pharo are > 0: Latin 1 > 255: Unicode > > which is fun because the first 255 characters are interned and you > therefore can't change their leadingChar. So if you're using Unicode, > you're forced to mix leachingChars. so what do you suggest to have unicode as zero? Stef _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On 19.05.2010 19:35, Stéphane Ducasse wrote:
>>> >>> >> >> Currently the semantics in Pharo are >> 0: Latin 1 >> 255: Unicode >> >> which is fun because the first 255 characters are interned and you >> therefore can't change their leadingChar. So if you're using Unicode, >> you're forced to mix leachingChars. > > so what do you suggest to have unicode as zero? Yes, the decision of Squeak 4.1 to make Unicode leadingChar 0 is an improvement. Cheers Philippe _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
ok
may be you should have said that before :) Now how this can be done? changing the EncodedCharSets? > initialize > > self allSubclassesDo: [:each | each initialize]. > > EncodedCharSets := Array new: 256. > > EncodedCharSets at: 0+1 put: Unicode "Latin1Environment". > EncodedCharSets at: 1+1 put: JISX0208. > EncodedCharSets at: 2+1 put: GB2312. > EncodedCharSets at: 3+1 put: KSX1001. > EncodedCharSets at: 4+1 put: JISX0208. Then I do not understand because EncodedCharSets at: 0+1 put: Unicode seems to me that this is already the case but may be I'm not looking at the right place. What are the implications? What will we break. Stef On May 19, 2010, at 9:06 PM, Philippe Marschall wrote: > On 19.05.2010 19:35, Stéphane Ducasse wrote: >>>> >>>> >>> >>> Currently the semantics in Pharo are >>> 0: Latin 1 >>> 255: Unicode >>> >>> which is fun because the first 255 characters are interned and you >>> therefore can't change their leadingChar. So if you're using Unicode, >>> you're forced to mix leachingChars. >> >> so what do you suggest to have unicode as zero? > > Yes, the decision of Squeak 4.1 to make Unicode leadingChar 0 is an > improvement. > > Cheers > Philippe > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
On 19.05.2010 21:44, Stéphane Ducasse wrote:
> ok > may be you should have said that before :) > Now how this can be done? > changing the EncodedCharSets? > >> initialize >> >> self allSubclassesDo: [:each | each initialize]. >> >> EncodedCharSets := Array new: 256. >> >> EncodedCharSets at: 0+1 put: Unicode "Latin1Environment". >> EncodedCharSets at: 1+1 put: JISX0208. >> EncodedCharSets at: 2+1 put: GB2312. >> EncodedCharSets at: 3+1 put: KSX1001. >> EncodedCharSets at: 4+1 put: JISX0208. > Then I do not understand because > EncodedCharSets at: 0+1 put: Unicode > seems to me that this is already the case but may be I'm not looking at the right place. > > What are the implications? > What will we break. > > Stef It's broken then display of WideStrings with StrikeFonts, since that relied on Wide Characters being stops, which did happen when leadingChar was 255, but not when it's 0. That is manifested in that the rest of the string after a WideChar is rendered as ?'s. There's also an error causing the width of the chars to be wrong, and the first char on the line to be displayed on the previous line instead, but iirc that's not directly related to the stops issue. Cheers, Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Thanks for helping me there because I was really like "what?" but not totally confused yet.
On May 19, 2010, at 11:52 PM, Henrik Sperre Johansen wrote: > On 19.05.2010 21:44, Stéphane Ducasse wrote: >> ok >> may be you should have said that before :) >> Now how this can be done? >> changing the EncodedCharSets? >> >>> initialize >>> >>> self allSubclassesDo: [:each | each initialize]. >>> >>> EncodedCharSets := Array new: 256. >>> >>> EncodedCharSets at: 0+1 put: Unicode "Latin1Environment". >>> EncodedCharSets at: 1+1 put: JISX0208. >>> EncodedCharSets at: 2+1 put: GB2312. >>> EncodedCharSets at: 3+1 put: KSX1001. >>> EncodedCharSets at: 4+1 put: JISX0208. >> Then I do not understand because >> EncodedCharSets at: 0+1 put: Unicode >> seems to me that this is already the case but may be I'm not looking at the right place. >> >> What are the implications? >> What will we break. >> >> Stef > As you've noted, we've already done it. > It's broken then display of WideStrings with StrikeFonts, since that relied on Wide Characters being stops, which did happen when leadingChar was 255, but not when it's 0. > That is manifested in that the rest of the string after a WideChar is rendered as ?'s. > There's also an error causing the width of the chars to be wrong, and the first char on the line to be displayed on the previous line instead, but iirc that's not directly related to the stops issue. > > Cheers, > Henry > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
http://code.google.com/p/pharo/issues/detail?id=2448
>>>> >> As you've noted, we've already done it. >> It's broken then display of WideStrings with StrikeFonts, since that relied on Wide Characters being stops, which did happen when leadingChar was 255, but not when it's 0. >> That is manifested in that the rest of the string after a WideChar is rendered as ?'s. >> There's also an error causing the width of the chars to be wrong, and the first char on the line to be displayed on the previous line instead, but iirc that's not directly related to the stops issue. >> >> Cheers, >> Henry _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Free forum by Nabble | Edit this page |