Character #asciiValue vs #charCode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Character #asciiValue vs #charCode

Sean P. DeNigris
Administrator
For Character, what is the difference between #asciiValue and #charCode (= #asciiValue bitAnd: 16r3FFFFF)?

Thanks.
Sean
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: Character #asciiValue vs #charCode

Nicolas Cellier
Sean P. DeNigris <sean <at> clipperadams.com> writes:

>
>
> For Character, what is the difference between #asciiValue and #charCode (=
> #asciiValue bitAnd: 16r3FFFFF)?
>
> Thanks.
> Sean

#asciiValue suggests the character is encoded in ASCII.
But hey, it's not general ! What is the ASCII code of é ?
It can be used by legacy code dating from ages...
...when Smalltalk characters were all in the ASCII set.
All ? well, but the left and up arrow maybe ;)

The modern replacement of #asciiValue is #charCode.

So, what the hell means this bitAnd: 16r3FFFFF ?
Well, because in Squeak Character encoding, bits above don't encode the
character by itself but the so called #leadingChar. This leadingChar holds
information about the environment and the encoding which should be used to
interpret the charCode.

In fact, the charCode will most likely return a unicode code point
(http://en.wikipedia.org/wiki/ISO/CEI_10646), except if leadingChar ~= 0, which
can be the case for some east-asian languages environments.

Note that a previous replacement - #codePoint - appears unsent...
This codePoint does not deal with leadingChar, so i'm not sure it's correct.

Hope it helps.

Nicolas

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Character #asciiValue vs #charCode

Andreas.Raab
On 1/7/2011 9:51 PM, nicolas cellier wrote:
> So, what the hell means this bitAnd: 16r3FFFFF ?
> Well, because in Squeak Character encoding, bits above don't encode the
> character by itself but the so called #leadingChar. This leadingChar holds
> information about the environment and the encoding which should be used to
> interpret the charCode.

The background of which is Han unification
(http://en.wikipedia.org/wiki/Han_unification). The language environment
(encoded in the upper bits) disambiguates the character if necessary.

Cheers,
   - Andreas

> In fact, the charCode will most likely return a unicode code point
> (http://en.wikipedia.org/wiki/ISO/CEI_10646), except if leadingChar ~= 0, which
> can be the case for some east-asian languages environments.
>
> Note that a previous replacement - #codePoint - appears unsent...
> This codePoint does not deal with leadingChar, so i'm not sure it's correct.
>
> Hope it helps.
>
> Nicolas

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Character #asciiValue vs #charCode

Sean P. DeNigris
Administrator
In reply to this post by Sean P. DeNigris
#asciiValue - could there be an ascii character with a leadingChar, or will this always be 0 for non-eastern characters?  Should there be any error checking - what is the meaning of ascii value for a non-ascii char?

#leadingChar
"In Squeak Character encoding, bits above 16r3FFFFF don't encode the character, but hold information about the language environment and the encoding which should be used to interpret the charCode. The background of which is Han unification (http://en.wikipedia.org/wiki/Han_unification)."

How's that as a method comment?  Is it really "In Squeak... encoding..." or does this apply to unicode in general?

Sean
Cheers,
Sean
Reply | Threaded
Open this post in threaded view
|

Re: Character #asciiValue vs #charCode

Andreas.Raab
On 1/8/2011 2:16 AM, Sean P. DeNigris wrote:
> #leadingChar
> "In Squeak Character encoding, bits above 16r3FFFFF don't encode the
> character, but hold information about the language environment and the
> encoding which should be used to interpret the charCode. The background of
> which is Han unification (http://en.wikipedia.org/wiki/Han_unification)."
>
> How's that as a method comment?  Is it really "In Squeak... encoding..." or
> does this apply to unicode in general?

It is Squeak specific. Unicode does not have a leading char.

Cheers,
   - Andreas
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Character #asciiValue vs #charCode

Nicolas Cellier
In reply to this post by Sean P. DeNigris
Sean P. DeNigris <sean <at> clipperadams.com> writes:

>
>
> #asciiValue - could there be an ascii character with a leadingChar, or will
> this always be 0 for non-eastern characters?  Should there be any error
> checking - what is the meaning of ascii value for a non-ascii char?
>

I would simply let asciiValue as is.
In method comment,
1) I would encourage for restricting usage to legacy code,
2) and warn for undefined behavior if the character is not in the ASCII set.

I don't know if there can be some ASCII characters with a leadingChar ~= 0.
But we should better not care too much of it...
Legacy code should only deal with ByteString.
ByteString can't have any leadingChar ~= 0 anyway.

> #leadingChar
> "In Squeak Character encoding, bits above 16r3FFFFF don't encode the
> character, but hold information about the language environment and the
> encoding which should be used to interpret the charCode. The background of
> which is Han unification (http://en.wikipedia.org/wiki/Han_unification)."
>
> How's that as a method comment?  Is it really "In Squeak... encoding..." or
> does this apply to unicode in general?
>
> Sean


Sure, IMO the whole thing deserve a good class comment too.
Maybe method comments should refer to class comment.
Very few people understand the issue...
Unless exposed to asian typographic problems.

Nicolas

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners