Character variants / leadingChar / Han unification

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Character variants / leadingChar / Han unification

Bert Freudenberg
Thanks for the historic account, Chris!

So we didn't replace the leadingChar mechanism, we just redefined "leadingChar = 0" to mean "unicode" rather than "latin1".

The mechanism itself is still in place. It's a hack, admittedly, but as long as we're passing plain strings around we have no other way of retaining language information.

A better way may be to support Unicode variation selectors. Then again, I don't know too much about that. Any native speaker to help us out?

- Bert -

On Thu, Jan 26, 2017 at 11:36 PM, Chris Cunningham <[hidden email]> wrote:
So, back in 2009, Andreas proposed:

---------------------------
What I would propose to do here is to define that "leadingChar = 0" which currently means "Latin1 encoding, language neutral" is being redefined to "Unicode encoding, language neutral". What this does is that "Character value: 353" and "Unicode value: 353" become the same, if the environment is considered language neutral which by default it would be.
---------------------

In 2010, he pushed this into Squeak Trunk.

Then, in 2011, there was a conversation where Andreas stated:

-------------------
On 1/8/2011 2:16 AM, Sean P. DeNigris wrote:
#leadingChar
"In Squeak Character encoding, bits above 16r3FFFFF don't encode the
character, but hold information about the language environment and the
encoding which should be used to interpret the charCode. The background of

How's that as a method comment?  Is it really "In Squeak... encoding..." or
does this apply to unicode in general?

It is Squeak specific. Unicode does not have a leading char.

Cheers,
  - Andreas
---------------------

Maybe this later email was the one that you were interested in?

I can't find any mention in the commit list or other discussions where the leadingChar was dropped, but I'm not an expert in this space (just interested).

Thanks,
cbc



Reply | Threaded
Open this post in threaded view
|

Re: Character variants / leadingChar / Han unification

Nicolas Cellier
All these ideas were floating around at least two years before, I think with impulsion of the web guys (seaside, etc...)
Promotion of unicode and using leadingChar = 0 for unicode were suggested several times.
It's just that Andreas analysis and synthesis was brilliant!
Since he had commited a bunch of improvments in this area, I think he knew exactly what he was taliking about

Effective replacement happened a bit later in Multilingual-nice.91 on 28 February 2010.

2017-01-27 17:05 GMT+01:00 Bert Freudenberg <[hidden email]>:
Thanks for the historic account, Chris!

So we didn't replace the leadingChar mechanism, we just redefined "leadingChar = 0" to mean "unicode" rather than "latin1".

The mechanism itself is still in place. It's a hack, admittedly, but as long as we're passing plain strings around we have no other way of retaining language information.

A better way may be to support Unicode variation selectors. Then again, I don't know too much about that. Any native speaker to help us out?

- Bert -

On Thu, Jan 26, 2017 at 11:36 PM, Chris Cunningham <[hidden email]> wrote:
So, back in 2009, Andreas proposed:

---------------------------
What I would propose to do here is to define that "leadingChar = 0" which currently means "Latin1 encoding, language neutral" is being redefined to "Unicode encoding, language neutral". What this does is that "Character value: 353" and "Unicode value: 353" become the same, if the environment is considered language neutral which by default it would be.
---------------------

In 2010, he pushed this into Squeak Trunk.

Then, in 2011, there was a conversation where Andreas stated:

-------------------
On 1/8/2011 2:16 AM, Sean P. DeNigris wrote:
#leadingChar
"In Squeak Character encoding, bits above 16r3FFFFF don't encode the
character, but hold information about the language environment and the
encoding which should be used to interpret the charCode. The background of

How's that as a method comment?  Is it really "In Squeak... encoding..." or
does this apply to unicode in general?

It is Squeak specific. Unicode does not have a leading char.

Cheers,
  - Andreas
---------------------

Maybe this later email was the one that you were interested in?

I can't find any mention in the commit list or other discussions where the leadingChar was dropped, but I'm not an expert in this space (just interested).

Thanks,
cbc