Smalltalk › Squeak › Squeak - Dev

[squeak-dev] leadingChar proposal

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

14 messages Options

Andreas.Raab

[squeak-dev] leadingChar proposal

Folks -

I think it's time to do something about the leadingChar in Characters
that has been on the TODO list for a while. I have been looking over
this stuff for some time now, fixing things here and there and laying
some of the ground work for the things to come.

Here is the good news: Squeak doesn't need the leadingChar any longer.
If you are running an updated trunk image you can run entirely without
the leadingChar being used, and I've done this for about a week now with
no ill side effects (disclaimer: I haven't been using very much of m17n
support stuff so there may still be breakage but it means it won't
explode in your face straightaway). If you would like to try yourself,
all you need to do is to hack Character>>setValue: to say, e.g.,

value := newValue bitClear: 16r3FC00000.

and you're good (and won't ever see a leadingChar). However, the removal
of the leading char could be used to do a couple of other things that I
would like to discuss and solicit feedback in particular from the folks
who care about the leadingChar.

The main insight is that although we *can* run without the leadingChar,
it doesn't mean we *have* to. As it stands, the leading char is used for
two purposes: Character set selection (EncodedCharSet) and (parts of)
language support. There is a significant amount of confusion between the
two with Latin1/Latin2Environment subclasses of LanguageEnvironment
(although these are character encodings not languagse).

What I would propose to do here is to define that "leadingChar = 0"
which currently means "Latin1 encoding, language neutral" is being
redefined to "Unicode encoding, language neutral". What this does is
that "Character value: 353" and "Unicode value: 353" become the same, if
the environment is considered language neutral which by default it would be.

All but the environment which care about the connotations of the
language tag should be able to work with this definition without any
change whatsovever. The only thing that changes is that the default
LanguageEnvironment is Unicode based, using leadingChar=0, most of the
subclasses go away (being replaced by the default LanguageEnvironment)
and those that we care about, or need a transition plan (i.e., the CJK
languages) we keep using the language tag for the time being.

That means that *if* you set your language environment to be one of the
CJK languages you get a language tag in your strings, but by default the
language neutral environment will produce "plain Unicode". Which should
make the server/seaside/aida people a lot more happy when dealing with
this stuff.

For the CJK languages (or other languages requiring support that has
been so far expressed via the languag tag) we can use this opportunity
and phase the use of the language tag out in favor of using text
attributes (which would have to be written first).

The main advantage of the proposal is that the people who would like to
use plain Unicode get to use it, and the people who care about the
language tag and its consequences can still use that as well.

How does that sound?

Cheers,
- Andreas

Yoshiki Ohshima-2

Re: [squeak-dev] leadingChar proposal

At Thu, 27 Aug 2009 21:09:48 -0700,
Andreas Raab wrote:
>
> What I would propose to do here is to define that "leadingChar = 0"
> which currently means "Latin1 encoding, language neutral" is being
> redefined to "Unicode encoding, language neutral". What this does is
> that "Character value: 353" and "Unicode value: 353" become the same, if
> the environment is considered language neutral which by default it would be.

Yes, if this is the basis, many things just follow. For Pharo
people I once suggested something similar (merging Unicode
(EncodedCharSet be =0), thinking that they are less concerned with the
backward compatibility. There will be backward compatiblity issue
(like even loading old Etoys projects, if the Etoys packaging work is
ever done) but I think that it is mostly solvable, and probably for
bigger Squeak community it is good.

> For the CJK languages (or other languages requiring support that has
> been so far expressed via the languag tag) we can use this opportunity
> and phase the use of the language tag out in favor of using text
> attributes (which would have to be written first).

Right.

> The main advantage of the proposal is that the people who would like to
> use plain Unicode get to use it, and the people who care about the
> language tag and its consequences can still use that as well.
>
> How does that sound?

Pretty good.

-- Yoshiki

Yoshiki Ohshima-2

Re: [squeak-dev] leadingChar proposal

At Thu, 27 Aug 2009 21:29:53 -0700,
Yoshiki Ohshima wrote:

>
> At Thu, 27 Aug 2009 21:09:48 -0700,
> Andreas Raab wrote:
> >
> > What I would propose to do here is to define that "leadingChar = 0"
> > which currently means "Latin1 encoding, language neutral" is being
> > redefined to "Unicode encoding, language neutral". What this does is
> > that "Character value: 353" and "Unicode value: 353" become the same, if
> > the environment is considered language neutral which by default it would be.
>
> Yes, if this is the basis, many things just follow. For Pharo
> people I once suggested something similar (merging Unicode
> (EncodedCharSet be =0), thinking that they are less concerned with the
> backward compatibility. There will be backward compatiblity issue
> (like even loading old Etoys projects, if the Etoys packaging work is
> ever done) but I think that it is mostly solvable, and probably for
> bigger Squeak community it is good.

One question is the roadmap; I would think ByteStrings will be
retained for a while (or forever) but may be also phased out. And
also it would be nice to tag ByteStrings. The natural order may be to
try to move on to text attribute approach earlier so that the bare
representation doesn't matter much. How do you think about these
things?

-- Yoshiki

Andreas.Raab

[squeak-dev] Re: leadingChar proposal

Yoshiki Ohshima wrote:
> One question is the roadmap; I would think ByteStrings will be
> retained for a while (or forever) but may be also phased out. And
> also it would be nice to tag ByteStrings. The natural order may be to
> try to move on to text attribute approach earlier so that the bare
> representation doesn't matter much. How do you think about these
> things?

Interesting questions. I'm not sure what you mean by "tagging
ByteStrings" - generally my opinion is that String/ByteString/WideString
have the same reationship that Integer/SmallInteger/LargeInteger have.
In other words, a nice optimization if you can afford staying within
bytes but it doesn't really matter.

I would think the real next step in this area should be to remove the
MultiScanner classes and fold all of them into one hierarchy. This whole
area is currently very complex for no real benefit since there is no
measurable performance penalty when folding these classes.

Cheers,
- Andreas

Yoshiki Ohshima-2

Re: [squeak-dev] Re: leadingChar proposal

At Thu, 27 Aug 2009 22:19:49 -0700,
Andreas Raab wrote:

>
> Yoshiki Ohshima wrote:
> > One question is the roadmap; I would think ByteStrings will be
> > retained for a while (or forever) but may be also phased out. And
> > also it would be nice to tag ByteStrings. The natural order may be to
> > try to move on to text attribute approach earlier so that the bare
> > representation doesn't matter much. How do you think about these
> > things?
>
> Interesting questions. I'm not sure what you mean by "tagging
> ByteStrings" - generally my opinion is that String/ByteString/WideString
> have the same reationship that Integer/SmallInteger/LargeInteger have.

With characters in 0..255 range, somebody may want to define
language tags and put them. It would be nice if we can do that to be
transparent.

-- Yoshiki

Philippe Marschall

Re: [squeak-dev] leadingChar proposal

In reply to this post by Andreas.Raab

2009/8/28 Andreas Raab <[hidden email]>:

> Folks -
>
> I think it's time to do something about the leadingChar in Characters that
> has been on the TODO list for a while. I have been looking over this stuff
> for some time now, fixing things here and there and laying some of the
> ground work for the things to come.
>
> Here is the good news: Squeak doesn't need the leadingChar any longer. If
> you are running an updated trunk image you can run entirely without the
> leadingChar being used, and I've done this for about a week now with no ill
> side effects (disclaimer: I haven't been using very much of m17n support
> stuff so there may still be breakage but it means it won't explode in your
> face straightaway). If you would like to try yourself, all you need to do is
> to hack Character>>setValue: to say, e.g.,
>
> value := newValue bitClear: 16r3FC00000.
>
> and you're good (and won't ever see a leadingChar). However, the removal of
> the leading char could be used to do a couple of other things that I would
> like to discuss and solicit feedback in particular from the folks who care
> about the leadingChar.
>
> The main insight is that although we *can* run without the leadingChar, it
> doesn't mean we *have* to. As it stands, the leading char is used for two
> purposes: Character set selection (EncodedCharSet) and (parts of) language
> support. There is a significant amount of confusion between the two with
> Latin1/Latin2Environment subclasses of LanguageEnvironment (although these
> are character encodings not languagse).
>
> What I would propose to do here is to define that "leadingChar = 0" which
> currently means "Latin1 encoding, language neutral" is being redefined to
> "Unicode encoding, language neutral". What this does is that "Character
> value: 353" and "Unicode value: 353" become the same, if the environment is
> considered language neutral which by default it would be.
>
> All but the environment which care about the connotations of the language
> tag should be able to work with this definition without any change
> whatsovever. The only thing that changes is that the default
> LanguageEnvironment is Unicode based, using leadingChar=0, most of the
> subclasses go away (being replaced by the default LanguageEnvironment) and
> those that we care about, or need a transition plan (i.e., the CJK
> languages) we keep using the language tag for the time being.
>
> That means that *if* you set your language environment to be one of the CJK
> languages you get a language tag in your strings, but by default the
> language neutral environment will produce "plain Unicode". Which should make
> the server/seaside/aida people a lot more happy when dealing with this
> stuff.
>
> For the CJK languages (or other languages requiring support that has been so
> far expressed via the languag tag) we can use this opportunity and phase the
> use of the language tag out in favor of using text attributes (which would
> have to be written first).
>
> The main advantage of the proposal is that the people who would like to use
> plain Unicode get to use it, and the people who care about the language tag
> and its consequences can still use that as well.
>
> How does that sound?

Like good news.

Cheers
Philippe

Philippe Marschall

Re: [squeak-dev] leadingChar proposal

In reply to this post by Yoshiki Ohshima-2

2009/8/28 Yoshiki Ohshima <[hidden email]>:
> ...
>
> One question is the roadmap; I would think ByteStrings will be
> retained for a while (or forever) but may be also phased out.

I would hope that ByteStrings are retained. I don't feel that
WideStrings as a general replacement for ByteStrings.

> And
> also it would be nice to tag ByteStrings. The natural order may be to
> try to move on to text attribute approach earlier so that the bare
> representation doesn't matter much.

Can you elaborate a bit?

Cheers
Philippe

Bert Freudenberg

Re: [squeak-dev] leadingChar proposal

On 28.08.2009, at 08:19, Philippe Marschall wrote:

> 2009/8/28 Yoshiki Ohshima <[hidden email]>:
>> ...
>>
>> One question is the roadmap; I would think ByteStrings will be
>> retained for a while (or forever) but may be also phased out.
>
> I would hope that ByteStrings are retained. I don't feel that
> WideStrings as a general replacement for ByteStrings.

Wouldn't ByteArrays be a better way to efficiently store arrays of
bytes? Strings are conceptually made of Characters, and there are more
than 256 of them. E.g. a la Python 3:

http://www.devx.com/opensource/Article/41398/1763/page/5

>> And
>> also it would be nice to tag ByteStrings. The natural order may be
>> to
>> try to move on to text attribute approach earlier so that the bare
>> representation doesn't matter much.
>
> Can you elaborate a bit?

A Text defines attributes for Character runs in a String. Instead of
storing the tag in each Character, it could be stored in an attribute
of the Text. Instead of passing around bare Strings you would pass
around Text objects (if you need to preserve language tags).

- Bert -

Bert Freudenberg

Re: [squeak-dev] leadingChar proposal

In reply to this post by Yoshiki Ohshima-2

On 28.08.2009, at 06:29, Yoshiki Ohshima wrote:

> At Thu, 27 Aug 2009 21:09:48 -0700,
> Andreas Raab wrote:
>>
>> What I would propose to do here is to define that "leadingChar = 0"
>> which currently means "Latin1 encoding, language neutral" is being
>> redefined to "Unicode encoding, language neutral". What this does is
>> that "Character value: 353" and "Unicode value: 353" become the
>> same, if
>> the environment is considered language neutral which by default it
>> would be.
>
> Yes, if this is the basis, many things just follow. For Pharo
> people I once suggested something similar (merging Unicode
> (EncodedCharSet be =0), thinking that they are less concerned with the
> backward compatibility. There will be backward compatiblity issue
> (like even loading old Etoys projects, if the Etoys packaging work is
> ever done) but I think that it is mostly solvable, and probably for
> bigger Squeak community it is good.
>
>> For the CJK languages (or other languages requiring support that has
>> been so far expressed via the languag tag) we can use this
>> opportunity
>> and phase the use of the language tag out in favor of using text
>> attributes (which would have to be written first).
>
> Right.
>
>> The main advantage of the proposal is that the people who would
>> like to
>> use plain Unicode get to use it, and the people who care about the
>> language tag and its consequences can still use that as well.
>>
>> How does that sound?
>
> Pretty good.
>
> -- Yoshiki

Hehe, if Yoshiki agrees: +1

- Bert -

Philippe Marschall

Re: [squeak-dev] leadingChar proposal

In reply to this post by Bert Freudenberg

2009/8/28 Bert Freudenberg <[hidden email]>:
>...
> Wouldn't ByteArrays be a better way to efficiently store arrays of bytes?

For arrays of bytes yes, for Latin-1 strings no.

> Strings are conceptually made of Characters, and there are more than 256 of
> them. E.g. a la Python 3:

Sure, there are also Integers bigger than 2^30 - 1, that doesn't mean
that SmallInteger is a stupid idea and should be dropped. Especially
considering that WideStrings still have performance issues and bugs.

> http://www.devx.com/opensource/Article/41398/1763/page/5

3.1 reimplemented a lot of the IO stuff from 3.0 in C for pure speed reasons.

>>> And
>>> also it would be nice to tag ByteStrings. The natural order may be to
>>> try to move on to text attribute approach earlier so that the bare
>>> representation doesn't matter much.
>>
>> Can you elaborate a bit?
>
>
> A Text defines attributes for Character runs in a String. Instead of storing
> the tag in each Character, it could be stored in an attribute of the Text.
> Instead of passing around bare Strings you would pass around Text objects
> (if you need to preserve language tags).

Yeah, storing that in Text objects instead of Strings seems like the
better way to go.

Cheers
Philippe

Bert Freudenberg

[squeak-dev] ByteString vs EncodedString vs ByteArray (was Re: leadingChar proposal)

On 28.08.2009, at 14:29, Philippe Marschall wrote:

> 2009/8/28 Bert Freudenberg <[hidden email]>:
>> ...
>> Wouldn't ByteArrays be a better way to efficiently store arrays of
>> bytes?
>
> For arrays of bytes yes, for Latin-1 strings no.

But ByteStrings are not really Latin1. We just pretend they are, for
display purposes.

>> Strings are conceptually made of Characters, and there are more
>> than 256 of
>> them. E.g. a la Python 3:
>
> Sure, there are also Integers bigger than 2^30 - 1, that doesn't mean
> that SmallInteger is a stupid idea and should be dropped. Especially
> considering that WideStrings still have performance issues and bugs.

We're not talking about doing this in the immediate future I believe.
But talking about it is valid.

- Bert -

Colin Putney

[squeak-dev] ByteString vs EncodedString vs ByteArray (was Re: leadingChar proposal)

In reply to this post by Bert Freudenberg

On 28-Aug-09, at 1:09 AM, Bert Freudenberg wrote:

> Wouldn't ByteArrays be a better way to efficiently store arrays of
> bytes? Strings are conceptually made of Characters, and there are
> more than 256 of them. E.g. a la Python 3:

So you're proposing that WideString, once it no longer has language
tags, use its 4 bytes per character to point to Character objects
rather than encoding the string at all? That would certainly be an
interesting implementation. It would trade space for speed (of certain
operations) in the case of CJK and other writing systems that involve
large numbers of characters, as you'd have a bunch of Character
objects persisting in the image, rather than just ephemerally. For
some applications, that's exactly the right design choice, no doubt.

On the other hand EncodedString (and subclasses like Utf8String or
Latin1String) would make a different trade-off, speed (of certain
operations) for space. Any #variableByteSubclass can effieciently
store bytes. The reason to use say, Utf8String rather than ByteArray
is precisely *because* Strings are conceptually made of Characters.
Encapsulation and all that.

> A Text defines attributes for Character runs in a String. Instead of
> storing the tag in each Character, it could be stored in an
> attribute of the Text. Instead of passing around bare Strings you
> would pass around Text objects (if you need to preserve language
> tags).

Sounds good.

Colin

Bert Freudenberg

Re: [squeak-dev] ByteString vs EncodedString vs ByteArray (was Re: leadingChar proposal)

> At Thu, 27 Aug 2009 22:19:49 -0700,
> Andreas Raab wrote:
>>
>> Yoshiki Ohshima wrote:
>>> One question is the roadmap; I would think ByteStrings will be
>>> retained for a while (or forever) but may be also phased out. And
>>> also it would be nice to tag ByteStrings. The natural order may
>>> be to
>>> try to move on to text attribute approach earlier so that the bare
>>> representation doesn't matter much. How do you think about these
>>> things?
>>
>> Interesting questions. I'm not sure what you mean by "tagging
>> ByteStrings" - generally my opinion is that String/ByteString/
>> WideString
>> have the same reationship that Integer/SmallInteger/LargeInteger
>> have.
>
> With characters in 0..255 range, somebody may want to define
> language tags and put them. It would be nice if we can do that to be
> transparent.
>
> -- Yoshiki

On 28.08.2009, at 15:28, Colin Putney wrote:

> On 28-Aug-09, at 1:09 AM, Bert Freudenberg wrote:
>
>> Wouldn't ByteArrays be a better way to efficiently store arrays of
>> bytes? Strings are conceptually made of Characters, and there are
>> more than 256 of them. E.g. a la Python 3:
>
> So you're proposing that WideString, once it no longer has language
> tags, use its 4 bytes per character to point to Character objects
> rather than encoding the string at all? That would certainly be an
> interesting implementation. It would trade space for speed (of
> certain operations) in the case of CJK and other writing systems
> that involve large numbers of characters, as you'd have a bunch of
> Character objects persisting in the image, rather than just
> ephemerally. For some applications, that's exactly the right design
> choice, no doubt.

I'm not really proposing anything at this point, just widening the
discussion Yoshiki started (cited above for reference).

> On the other hand EncodedString (and subclasses like Utf8String or
> Latin1String) would make a different trade-off, speed (of certain
> operations) for space. Any #variableByteSubclass can effieciently
> store bytes. The reason to use say, Utf8String rather than ByteArray
> is precisely *because* Strings are conceptually made of Characters.
> Encapsulation and all that.

I guess having encoded strings would be nice. OTOH I value simplicity.
Does anybody have experience with the tradeoffs?

- Bert -

Philippe Marschall

Re: [squeak-dev] ByteString vs EncodedString vs ByteArray (was Re: leadingChar proposal)

In reply to this post by Bert Freudenberg

2009/8/28 Bert Freudenberg <[hidden email]>:

> On 28.08.2009, at 14:29, Philippe Marschall wrote:
>
>> 2009/8/28 Bert Freudenberg <[hidden email]>:
>>>
>>> ...
>>> Wouldn't ByteArrays be a better way to efficiently store arrays of bytes?
>>
>> For arrays of bytes yes, for Latin-1 strings no.
>
> But ByteStrings are not really Latin1.

Yes they are. All characters in the Latin1 range are interned and
ByteString is for exactly those 8bit characters.

Cheers
Philippe