Folks -
I think it's time to do something about the leadingChar in Characters that has been on the TODO list for a while. I have been looking over this stuff for some time now, fixing things here and there and laying some of the ground work for the things to come. Here is the good news: Squeak doesn't need the leadingChar any longer. If you are running an updated trunk image you can run entirely without the leadingChar being used, and I've done this for about a week now with no ill side effects (disclaimer: I haven't been using very much of m17n support stuff so there may still be breakage but it means it won't explode in your face straightaway). If you would like to try yourself, all you need to do is to hack Character>>setValue: to say, e.g., value := newValue bitClear: 16r3FC00000. and you're good (and won't ever see a leadingChar). However, the removal of the leading char could be used to do a couple of other things that I would like to discuss and solicit feedback in particular from the folks who care about the leadingChar. The main insight is that although we *can* run without the leadingChar, it doesn't mean we *have* to. As it stands, the leading char is used for two purposes: Character set selection (EncodedCharSet) and (parts of) language support. There is a significant amount of confusion between the two with Latin1/Latin2Environment subclasses of LanguageEnvironment (although these are character encodings not languagse). What I would propose to do here is to define that "leadingChar = 0" which currently means "Latin1 encoding, language neutral" is being redefined to "Unicode encoding, language neutral". What this does is that "Character value: 353" and "Unicode value: 353" become the same, if the environment is considered language neutral which by default it would be. All but the environment which care about the connotations of the language tag should be able to work with this definition without any change whatsovever. The only thing that changes is that the default LanguageEnvironment is Unicode based, using leadingChar=0, most of the subclasses go away (being replaced by the default LanguageEnvironment) and those that we care about, or need a transition plan (i.e., the CJK languages) we keep using the language tag for the time being. That means that *if* you set your language environment to be one of the CJK languages you get a language tag in your strings, but by default the language neutral environment will produce "plain Unicode". Which should make the server/seaside/aida people a lot more happy when dealing with this stuff. For the CJK languages (or other languages requiring support that has been so far expressed via the languag tag) we can use this opportunity and phase the use of the language tag out in favor of using text attributes (which would have to be written first). The main advantage of the proposal is that the people who would like to use plain Unicode get to use it, and the people who care about the language tag and its consequences can still use that as well. How does that sound? Cheers, - Andreas |
At Thu, 27 Aug 2009 21:09:48 -0700,
Andreas Raab wrote: > > What I would propose to do here is to define that "leadingChar = 0" > which currently means "Latin1 encoding, language neutral" is being > redefined to "Unicode encoding, language neutral". What this does is > that "Character value: 353" and "Unicode value: 353" become the same, if > the environment is considered language neutral which by default it would be. Yes, if this is the basis, many things just follow. For Pharo people I once suggested something similar (merging Unicode (EncodedCharSet be =0), thinking that they are less concerned with the backward compatibility. There will be backward compatiblity issue (like even loading old Etoys projects, if the Etoys packaging work is ever done) but I think that it is mostly solvable, and probably for bigger Squeak community it is good. > For the CJK languages (or other languages requiring support that has > been so far expressed via the languag tag) we can use this opportunity > and phase the use of the language tag out in favor of using text > attributes (which would have to be written first). Right. > The main advantage of the proposal is that the people who would like to > use plain Unicode get to use it, and the people who care about the > language tag and its consequences can still use that as well. > > How does that sound? Pretty good. -- Yoshiki |
At Thu, 27 Aug 2009 21:29:53 -0700,
Yoshiki Ohshima wrote: > > At Thu, 27 Aug 2009 21:09:48 -0700, > Andreas Raab wrote: > > > > What I would propose to do here is to define that "leadingChar = 0" > > which currently means "Latin1 encoding, language neutral" is being > > redefined to "Unicode encoding, language neutral". What this does is > > that "Character value: 353" and "Unicode value: 353" become the same, if > > the environment is considered language neutral which by default it would be. > > Yes, if this is the basis, many things just follow. For Pharo > people I once suggested something similar (merging Unicode > (EncodedCharSet be =0), thinking that they are less concerned with the > backward compatibility. There will be backward compatiblity issue > (like even loading old Etoys projects, if the Etoys packaging work is > ever done) but I think that it is mostly solvable, and probably for > bigger Squeak community it is good. One question is the roadmap; I would think ByteStrings will be retained for a while (or forever) but may be also phased out. And also it would be nice to tag ByteStrings. The natural order may be to try to move on to text attribute approach earlier so that the bare representation doesn't matter much. How do you think about these things? -- Yoshiki |
Yoshiki Ohshima wrote:
> One question is the roadmap; I would think ByteStrings will be > retained for a while (or forever) but may be also phased out. And > also it would be nice to tag ByteStrings. The natural order may be to > try to move on to text attribute approach earlier so that the bare > representation doesn't matter much. How do you think about these > things? Interesting questions. I'm not sure what you mean by "tagging ByteStrings" - generally my opinion is that String/ByteString/WideString have the same reationship that Integer/SmallInteger/LargeInteger have. In other words, a nice optimization if you can afford staying within bytes but it doesn't really matter. I would think the real next step in this area should be to remove the MultiScanner classes and fold all of them into one hierarchy. This whole area is currently very complex for no real benefit since there is no measurable performance penalty when folding these classes. Cheers, - Andreas |
At Thu, 27 Aug 2009 22:19:49 -0700,
Andreas Raab wrote: > > Yoshiki Ohshima wrote: > > One question is the roadmap; I would think ByteStrings will be > > retained for a while (or forever) but may be also phased out. And > > also it would be nice to tag ByteStrings. The natural order may be to > > try to move on to text attribute approach earlier so that the bare > > representation doesn't matter much. How do you think about these > > things? > > Interesting questions. I'm not sure what you mean by "tagging > ByteStrings" - generally my opinion is that String/ByteString/WideString > have the same reationship that Integer/SmallInteger/LargeInteger have. With characters in 0..255 range, somebody may want to define language tags and put them. It would be nice if we can do that to be transparent. -- Yoshiki |
In reply to this post by Andreas.Raab
2009/8/28 Andreas Raab <[hidden email]>:
> Folks - > > I think it's time to do something about the leadingChar in Characters that > has been on the TODO list for a while. I have been looking over this stuff > for some time now, fixing things here and there and laying some of the > ground work for the things to come. > > Here is the good news: Squeak doesn't need the leadingChar any longer. If > you are running an updated trunk image you can run entirely without the > leadingChar being used, and I've done this for about a week now with no ill > side effects (disclaimer: I haven't been using very much of m17n support > stuff so there may still be breakage but it means it won't explode in your > face straightaway). If you would like to try yourself, all you need to do is > to hack Character>>setValue: to say, e.g., > > value := newValue bitClear: 16r3FC00000. > > and you're good (and won't ever see a leadingChar). However, the removal of > the leading char could be used to do a couple of other things that I would > like to discuss and solicit feedback in particular from the folks who care > about the leadingChar. > > The main insight is that although we *can* run without the leadingChar, it > doesn't mean we *have* to. As it stands, the leading char is used for two > purposes: Character set selection (EncodedCharSet) and (parts of) language > support. There is a significant amount of confusion between the two with > Latin1/Latin2Environment subclasses of LanguageEnvironment (although these > are character encodings not languagse). > > What I would propose to do here is to define that "leadingChar = 0" which > currently means "Latin1 encoding, language neutral" is being redefined to > "Unicode encoding, language neutral". What this does is that "Character > value: 353" and "Unicode value: 353" become the same, if the environment is > considered language neutral which by default it would be. > > All but the environment which care about the connotations of the language > tag should be able to work with this definition without any change > whatsovever. The only thing that changes is that the default > LanguageEnvironment is Unicode based, using leadingChar=0, most of the > subclasses go away (being replaced by the default LanguageEnvironment) and > those that we care about, or need a transition plan (i.e., the CJK > languages) we keep using the language tag for the time being. > > That means that *if* you set your language environment to be one of the CJK > languages you get a language tag in your strings, but by default the > language neutral environment will produce "plain Unicode". Which should make > the server/seaside/aida people a lot more happy when dealing with this > stuff. > > For the CJK languages (or other languages requiring support that has been so > far expressed via the languag tag) we can use this opportunity and phase the > use of the language tag out in favor of using text attributes (which would > have to be written first). > > The main advantage of the proposal is that the people who would like to use > plain Unicode get to use it, and the people who care about the language tag > and its consequences can still use that as well. > > How does that sound? Like good news. Cheers Philippe |
In reply to this post by Yoshiki Ohshima-2
2009/8/28 Yoshiki Ohshima <[hidden email]>:
> ... > > One question is the roadmap; I would think ByteStrings will be > retained for a while (or forever) but may be also phased out. I would hope that ByteStrings are retained. I don't feel that WideStrings as a general replacement for ByteStrings. > And > also it would be nice to tag ByteStrings. The natural order may be to > try to move on to text attribute approach earlier so that the bare > representation doesn't matter much. Can you elaborate a bit? Cheers Philippe |
On 28.08.2009, at 08:19, Philippe Marschall wrote: > 2009/8/28 Yoshiki Ohshima <[hidden email]>: >> ... >> >> One question is the roadmap; I would think ByteStrings will be >> retained for a while (or forever) but may be also phased out. > > I would hope that ByteStrings are retained. I don't feel that > WideStrings as a general replacement for ByteStrings. Wouldn't ByteArrays be a better way to efficiently store arrays of bytes? Strings are conceptually made of Characters, and there are more than 256 of them. E.g. a la Python 3: http://www.devx.com/opensource/Article/41398/1763/page/5 >> And >> also it would be nice to tag ByteStrings. The natural order may be >> to >> try to move on to text attribute approach earlier so that the bare >> representation doesn't matter much. > > Can you elaborate a bit? A Text defines attributes for Character runs in a String. Instead of storing the tag in each Character, it could be stored in an attribute of the Text. Instead of passing around bare Strings you would pass around Text objects (if you need to preserve language tags). - Bert - |
In reply to this post by Yoshiki Ohshima-2
On 28.08.2009, at 06:29, Yoshiki Ohshima wrote: > At Thu, 27 Aug 2009 21:09:48 -0700, > Andreas Raab wrote: >> >> What I would propose to do here is to define that "leadingChar = 0" >> which currently means "Latin1 encoding, language neutral" is being >> redefined to "Unicode encoding, language neutral". What this does is >> that "Character value: 353" and "Unicode value: 353" become the >> same, if >> the environment is considered language neutral which by default it >> would be. > > Yes, if this is the basis, many things just follow. For Pharo > people I once suggested something similar (merging Unicode > (EncodedCharSet be =0), thinking that they are less concerned with the > backward compatibility. There will be backward compatiblity issue > (like even loading old Etoys projects, if the Etoys packaging work is > ever done) but I think that it is mostly solvable, and probably for > bigger Squeak community it is good. > >> For the CJK languages (or other languages requiring support that has >> been so far expressed via the languag tag) we can use this >> opportunity >> and phase the use of the language tag out in favor of using text >> attributes (which would have to be written first). > > Right. > >> The main advantage of the proposal is that the people who would >> like to >> use plain Unicode get to use it, and the people who care about the >> language tag and its consequences can still use that as well. >> >> How does that sound? > > Pretty good. > > -- Yoshiki Hehe, if Yoshiki agrees: +1 - Bert - |
In reply to this post by Bert Freudenberg
2009/8/28 Bert Freudenberg <[hidden email]>:
>... > Wouldn't ByteArrays be a better way to efficiently store arrays of bytes? For arrays of bytes yes, for Latin-1 strings no. > Strings are conceptually made of Characters, and there are more than 256 of > them. E.g. a la Python 3: Sure, there are also Integers bigger than 2^30 - 1, that doesn't mean that SmallInteger is a stupid idea and should be dropped. Especially considering that WideStrings still have performance issues and bugs. > http://www.devx.com/opensource/Article/41398/1763/page/5 3.1 reimplemented a lot of the IO stuff from 3.0 in C for pure speed reasons. >>> And >>> also it would be nice to tag ByteStrings. The natural order may be to >>> try to move on to text attribute approach earlier so that the bare >>> representation doesn't matter much. >> >> Can you elaborate a bit? > > > A Text defines attributes for Character runs in a String. Instead of storing > the tag in each Character, it could be stored in an attribute of the Text. > Instead of passing around bare Strings you would pass around Text objects > (if you need to preserve language tags). Yeah, storing that in Text objects instead of Strings seems like the better way to go. Cheers Philippe |
On 28.08.2009, at 14:29, Philippe Marschall wrote:
> 2009/8/28 Bert Freudenberg <[hidden email]>: >> ... >> Wouldn't ByteArrays be a better way to efficiently store arrays of >> bytes? > > For arrays of bytes yes, for Latin-1 strings no. But ByteStrings are not really Latin1. We just pretend they are, for display purposes. >> Strings are conceptually made of Characters, and there are more >> than 256 of >> them. E.g. a la Python 3: > > Sure, there are also Integers bigger than 2^30 - 1, that doesn't mean > that SmallInteger is a stupid idea and should be dropped. Especially > considering that WideStrings still have performance issues and bugs. We're not talking about doing this in the immediate future I believe. But talking about it is valid. - Bert - |
In reply to this post by Bert Freudenberg
On 28-Aug-09, at 1:09 AM, Bert Freudenberg wrote: > Wouldn't ByteArrays be a better way to efficiently store arrays of > bytes? Strings are conceptually made of Characters, and there are > more than 256 of them. E.g. a la Python 3: So you're proposing that WideString, once it no longer has language tags, use its 4 bytes per character to point to Character objects rather than encoding the string at all? That would certainly be an interesting implementation. It would trade space for speed (of certain operations) in the case of CJK and other writing systems that involve large numbers of characters, as you'd have a bunch of Character objects persisting in the image, rather than just ephemerally. For some applications, that's exactly the right design choice, no doubt. On the other hand EncodedString (and subclasses like Utf8String or Latin1String) would make a different trade-off, speed (of certain operations) for space. Any #variableByteSubclass can effieciently store bytes. The reason to use say, Utf8String rather than ByteArray is precisely *because* Strings are conceptually made of Characters. Encapsulation and all that. > A Text defines attributes for Character runs in a String. Instead of > storing the tag in each Character, it could be stored in an > attribute of the Text. Instead of passing around bare Strings you > would pass around Text objects (if you need to preserve language > tags). Sounds good. Colin |
> At Thu, 27 Aug 2009 22:19:49 -0700,
> Andreas Raab wrote: >> >> Yoshiki Ohshima wrote: >>> One question is the roadmap; I would think ByteStrings will be >>> retained for a while (or forever) but may be also phased out. And >>> also it would be nice to tag ByteStrings. The natural order may >>> be to >>> try to move on to text attribute approach earlier so that the bare >>> representation doesn't matter much. How do you think about these >>> things? >> >> Interesting questions. I'm not sure what you mean by "tagging >> ByteStrings" - generally my opinion is that String/ByteString/ >> WideString >> have the same reationship that Integer/SmallInteger/LargeInteger >> have. > > With characters in 0..255 range, somebody may want to define > language tags and put them. It would be nice if we can do that to be > transparent. > > -- Yoshiki On 28.08.2009, at 15:28, Colin Putney wrote: > On 28-Aug-09, at 1:09 AM, Bert Freudenberg wrote: > >> Wouldn't ByteArrays be a better way to efficiently store arrays of >> bytes? Strings are conceptually made of Characters, and there are >> more than 256 of them. E.g. a la Python 3: > > So you're proposing that WideString, once it no longer has language > tags, use its 4 bytes per character to point to Character objects > rather than encoding the string at all? That would certainly be an > interesting implementation. It would trade space for speed (of > certain operations) in the case of CJK and other writing systems > that involve large numbers of characters, as you'd have a bunch of > Character objects persisting in the image, rather than just > ephemerally. For some applications, that's exactly the right design > choice, no doubt. I'm not really proposing anything at this point, just widening the discussion Yoshiki started (cited above for reference). > On the other hand EncodedString (and subclasses like Utf8String or > Latin1String) would make a different trade-off, speed (of certain > operations) for space. Any #variableByteSubclass can effieciently > store bytes. The reason to use say, Utf8String rather than ByteArray > is precisely *because* Strings are conceptually made of Characters. > Encapsulation and all that. I guess having encoded strings would be nice. OTOH I value simplicity. Does anybody have experience with the tradeoffs? - Bert - |
In reply to this post by Bert Freudenberg
2009/8/28 Bert Freudenberg <[hidden email]>:
> On 28.08.2009, at 14:29, Philippe Marschall wrote: > >> 2009/8/28 Bert Freudenberg <[hidden email]>: >>> >>> ... >>> Wouldn't ByteArrays be a better way to efficiently store arrays of bytes? >> >> For arrays of bytes yes, for Latin-1 strings no. > > But ByteStrings are not really Latin1. Yes they are. All characters in the Latin1 range are interned and ByteString is for exactly those 8bit characters. Cheers Philippe |
Free forum by Nabble | Edit this page |