Administrator
|
For Character, what is the difference between #asciiValue and #charCode (= #asciiValue bitAnd: 16r3FFFFF)?
Thanks. Sean
Cheers,
Sean |
Sean P. DeNigris <sean <at> clipperadams.com> writes:
> > > For Character, what is the difference between #asciiValue and #charCode (= > #asciiValue bitAnd: 16r3FFFFF)? > > Thanks. > Sean #asciiValue suggests the character is encoded in ASCII. But hey, it's not general ! What is the ASCII code of é ? It can be used by legacy code dating from ages... ...when Smalltalk characters were all in the ASCII set. All ? well, but the left and up arrow maybe ;) The modern replacement of #asciiValue is #charCode. So, what the hell means this bitAnd: 16r3FFFFF ? Well, because in Squeak Character encoding, bits above don't encode the character by itself but the so called #leadingChar. This leadingChar holds information about the environment and the encoding which should be used to interpret the charCode. In fact, the charCode will most likely return a unicode code point (http://en.wikipedia.org/wiki/ISO/CEI_10646), except if leadingChar ~= 0, which can be the case for some east-asian languages environments. Note that a previous replacement - #codePoint - appears unsent... This codePoint does not deal with leadingChar, so i'm not sure it's correct. Hope it helps. Nicolas _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
On 1/7/2011 9:51 PM, nicolas cellier wrote:
> So, what the hell means this bitAnd: 16r3FFFFF ? > Well, because in Squeak Character encoding, bits above don't encode the > character by itself but the so called #leadingChar. This leadingChar holds > information about the environment and the encoding which should be used to > interpret the charCode. The background of which is Han unification (http://en.wikipedia.org/wiki/Han_unification). The language environment (encoded in the upper bits) disambiguates the character if necessary. Cheers, - Andreas > In fact, the charCode will most likely return a unicode code point > (http://en.wikipedia.org/wiki/ISO/CEI_10646), except if leadingChar ~= 0, which > can be the case for some east-asian languages environments. > > Note that a previous replacement - #codePoint - appears unsent... > This codePoint does not deal with leadingChar, so i'm not sure it's correct. > > Hope it helps. > > Nicolas _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Administrator
|
In reply to this post by Sean P. DeNigris
#asciiValue - could there be an ascii character with a leadingChar, or will this always be 0 for non-eastern characters? Should there be any error checking - what is the meaning of ascii value for a non-ascii char?
#leadingChar "In Squeak Character encoding, bits above 16r3FFFFF don't encode the character, but hold information about the language environment and the encoding which should be used to interpret the charCode. The background of which is Han unification (http://en.wikipedia.org/wiki/Han_unification)." How's that as a method comment? Is it really "In Squeak... encoding..." or does this apply to unicode in general? Sean
Cheers,
Sean |
On 1/8/2011 2:16 AM, Sean P. DeNigris wrote:
> #leadingChar > "In Squeak Character encoding, bits above 16r3FFFFF don't encode the > character, but hold information about the language environment and the > encoding which should be used to interpret the charCode. The background of > which is Han unification (http://en.wikipedia.org/wiki/Han_unification)." > > How's that as a method comment? Is it really "In Squeak... encoding..." or > does this apply to unicode in general? It is Squeak specific. Unicode does not have a leading char. Cheers, - Andreas _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
In reply to this post by Sean P. DeNigris
Sean P. DeNigris <sean <at> clipperadams.com> writes:
> > > #asciiValue - could there be an ascii character with a leadingChar, or will > this always be 0 for non-eastern characters? Should there be any error > checking - what is the meaning of ascii value for a non-ascii char? > I would simply let asciiValue as is. In method comment, 1) I would encourage for restricting usage to legacy code, 2) and warn for undefined behavior if the character is not in the ASCII set. I don't know if there can be some ASCII characters with a leadingChar ~= 0. But we should better not care too much of it... Legacy code should only deal with ByteString. ByteString can't have any leadingChar ~= 0 anyway. > #leadingChar > "In Squeak Character encoding, bits above 16r3FFFFF don't encode the > character, but hold information about the language environment and the > encoding which should be used to interpret the charCode. The background of > which is Han unification (http://en.wikipedia.org/wiki/Han_unification)." > > How's that as a method comment? Is it really "In Squeak... encoding..." or > does this apply to unicode in general? > > Sean Sure, IMO the whole thing deserve a good class comment too. Maybe method comments should refer to class comment. Very few people understand the issue... Unless exposed to asian typographic problems. Nicolas _______________________________________________ Beginners mailing list [hidden email] http://lists.squeakfoundation.org/mailman/listinfo/beginners |
Free forum by Nabble | Edit this page |