Smalltalk › Squeak › Squeak - Dev

New Win32 VM [m17n testers needed]

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

124 messages Options

1 ... 34567

Yoshiki Ohshima

Re: UTF8 Squeak

Hi, Michael,

> > - Suppose you would like to use different line wrapping algorithms
> > for different languages, how would you keep that information?
>
> The question is which, if any, language dependent (text layout?!)
> attributes should be encoded into the String rather than kept as text
> attributes.

Exactly. I put my proposal for "a new system" in several emails
ago. In there, raw data cannot really be displayed without the text
attributes.

The question was that it was practical to retrofit that idea into
the Squeak system, where gross assumptions was made around Strings
(including being able to a symbol, being able to displayed, etc.) If
you would like to display any given String in a reasonable way in
Squeak, you needed to have it there with a String.

-- Yoshiki

Javier Diaz-Reinoso

Re: UTF8 Squeak

In reply to this post by Yoshiki Ohshima

On 11/06/2007, at 13:27, Yoshiki Ohshima wrote:

> Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
> For storing the bare Unicode code points, I think so. I'm not
> convinced that adding 16-bit variation solves any real problems. But
> there may be something.
>
> My first a few questions are:
>
> - While vast majority of strings for, say, Japanese can be
> represented with in the characters in BMP, you would use
> FourByteString for Chinese/Japanese/Korean and some others. Does
> this mean that you would *always* use FourByteString for these
> "languages" (and not scripts?)
>
> - Suppose you would like to use different line wrapping algorithms
> for different languages, how would you keep that information?
>
> -- Yoshiki
>

About 2 months ago in the OpenMCL mailing list have this UTF16 vs.
UTF32 discussion:
> how many angels can dance on a unicode character?
> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>

Gary Byers (the OpenMCL's developer) finish with this conclusion:
> If these numbers are roughly accurate and if the sketch of what
> a displaced SIMPLE-STRING object would look like is realistic,
> then I'd say that using UTF-16 to represent arbitrary Unicode
> characters in a realistic way costs about as much memory-wise
> as using UTF-32 does, is somewhat slower in the simplest cases
> and much slower in general, has very complex boundary
> cases once we step outside the BMP, and just generally doesn't
> seem to have many socially-redeeming qualities that I can see.

perhaps in Squeak is different (no alignment?), but if I doIt:
(ByteString allInstances collect:[:s | s size] ) sum asFloat (in a
3.8.1 basic image), I obtain:

1.943098e6, (63672 strings at 30.5 bytes average)

so, all of this talk is for about 4 MB extra (in that image squeak
take 26.8 MB at startup)?.

Klaus D. Witzel

Re: UTF8 Squeak

On Mon, 11 Jun 2007 21:29:54 +0200, Javier Diaz-Reinoso wrote:
...
> perhaps in Squeak is different (no alignment?), but if I doIt:
> (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1
> basic image), I obtain:
>
> 1.943098e6, (63672 strings at 30.5 bytes average)

You may want to add ByteSymbols, 648278.0 / 38156 16.99, from a squeak-dev
image.

/Klaus

> so, all of this talk is for about 4 MB extra (in that image squeak take
> 26.8 MB at startup)?.
>
>
>
>
>
>

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Colin Putney

Hi Colin,

Colin Putney wrote:

>
> On Jun 11, 2007, at 4:04 AM, Janko Mivšek wrote:
>
>> Anyone can definitively stay with UTF8 encoded strings in plan
>> BytString or subclass to UTF8String by himself. But I don't know why
>> we need to have UTF8String as part of string framework. Just because
>> of meaning? Then we also need to introduce an ASCIIString :)
>
>> I think that preserving simplicity is also an important goal. We need
>> to find a general yet simple solution for Unicode Strings, which will
>> be good enough for most uses, as is the case for numbers for instance.
>> We deal with more special cases separately. I claim that pure Unicode
>> strings in Byte, TwoByte or FourByteString is such a general support.
>> UTF8String is already a specific one.
>
> Ok, so what you're saying is this: ByteString, TwoByteString and
> FourByteString are good enough for the most purposes. Web developers and
> anyone else that needs to work with other encodings should roll their
> own solutions, so as not to burden the rest of the community with
> clutter caused by support for other encodings, or even hooks to make
> such things easy to integrate with the base string code.
>
> Is that a fair characterization of your position?
>

Yes, or just a bit better said: my position is a separation of internal
string representation from encodings. Internal strings should be in pure
Unicode while conversions to other encodings should be done separately,
probably best with already existing TextEncoders. Those text encoders
can be extended to meet wider requirements, but strings shall stay
strings - they shall contain characters only.

By the way, I'm a web developer too and porting Aida to Squeak actually
started my interest on Unicode support here :)

Best regards
JAnko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Yoshiki Ohshima

Hi Yoshiki,

Yoshiki Ohshima wrote:

For Slovenian language with Latin 2 script we need to have
TwoByteStrings, same goes for all East Europe, Greek, and Cyrillic. And
because I'm using an image as a database, I just cannot afford 4 byte
strings... And for shorter Slovenian strings even ByteStrings suffice.

> - While vast majority of strings for, say, Japanese can be
> represented with in the characters in BMP, you would use
> FourByteString for Chinese/Japanese/Korean and some others. Does
> this mean that you would *always* use FourByteString for these
> "languages" (and not scripts?)

My proposal allows strings to "scale" to support wider characters, by
widen themselves, from Byte to TwoByte and then FourByteString.

Determination of width of a string is automatic (as is already for
WideString): you start with ByteString and when you put a first
character with code point above 256, a ByteString is automatically
converted to TwoByteString or even FourByteString. Same goes for
TwoByteString when you add a character > 2**16.

Strings therefore don't need to be aware at all about languages they
support.

> - Suppose you would like to use different line wrapping algorithms
> for different languages, how would you keep that information?

Line ends internally should be Character cr only (how is that in Squeak
anyway?). Different line-ends are again a responsibility of streams to
the external world.

What about a separate Locale object for all that language specific
information?

Best regards
JAnko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Javier Diaz-Reinoso

Hi Javier,

Javier Diaz-Reinoso wrote:

> About 2 months ago in the OpenMCL mailing list have this UTF16 vs. UTF32
> discussion:
>> how many angels can dance on a unicode character?
>> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>>
>
> Gary Byers (the OpenMCL's developer) finish with this conclusion:
>> If these numbers are roughly accurate and if the sketch of what
>> a displaced SIMPLE-STRING object would look like is realistic,
>> then I'd say that using UTF-16 to represent arbitrary Unicode
>> characters in a realistic way costs about as much memory-wise
>> as using UTF-32 does, is somewhat slower in the simplest cases
>> and much slower in general, has very complex boundary
>> cases once we step outside the BMP, and just generally doesn't
>> seem to have many socially-redeeming qualities that I can see.

> perhaps in Squeak is different (no alignment?), but if I doIt:
> (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1
> basic image), I obtain:
>
> 1.943098e6, (63672 strings at 30.5 bytes average)
>
> so, all of this talk is for about 4 MB extra (in that image squeak take
> 26.8 MB at startup)?.

Consider image as a database where you store strings from your
application. In that case space efficient but still manipulable strings
really matter. For instance, I run one 380MB VW image full of
TwoByteStrings and this image would probably have 760M with only
FourByteStrings ...

Best regards
JAnko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Yoshiki Ohshima

Re: UTF8 Squeak

Janko,

> Consider image as a database where you store strings from your
> application. In that case space efficient but still manipulable strings
> really matter. For instance, I run one 380MB VW image full of
> TwoByteStrings and this image would probably have 760M with only
> FourByteStrings ...

Just a thought, but if the space efficiency in "the image as
database" is the biggest reason for you to add 16-bit variation, how
about you just write an optimized version of UTF16TextConverter that
works well for WideString (that would convert WideString from/to
ByteArray), and define #hibernate and #unhibernate methods (or
equivalents) at somewhere to convert it to/from upon image shutdown
time and start up time? This way, only strings you "touch" (to
display the content to screen, etc.) gets unhibernated to WideString,
and rest of (presumably majority of) strings can stay in the 16-bit
representation...

-- Yoshiki

Andreas.Raab

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

Janko Mivšek wrote:
>> so, all of this talk is for about 4 MB extra (in that image squeak
>> take 26.8 MB at startup)?.
>
> Consider image as a database where you store strings from your
> application. In that case space efficient but still manipulable strings
> really matter. For instance, I run one 380MB VW image full of
> TwoByteStrings and this image would probably have 760M with only
> FourByteStrings ...

Actually, I would be very interested in a more accurate answer than
"probably" since the 2x answer assumes that the whole image consists of
2-byte strings and that there is zero overhead for headers etc. both of
which is obviously not the case. If you wouldn't mind, could you run a
little script that computes the number of characters that are actually
stored as 2 bytes? Something like:

TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].

This strictly counts the number of characters that "matter", i.e., that
are affected by an encoding change and I'd be interested in getting some
data point about how that looks in a real application (e.g., whether
that is in the 10%, 25%, or 50% range). In particular considering that
VW probably uses the most compact form by default and that there is
probably quite a bit of application code running and that there is
probably more than just strings to keep in the data, I'm really curious
how much of that ends up to be relevant for the 2-byte encoding.

Thanks,
- Andreas

Colin Putney

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

On Jun 11, 2007, at 1:31 PM, Janko Mivšek wrote:

>> Is that a fair characterization of your position?
> Yes, or just a bit better said: my position is a separation of
> internal string representation from encodings. Internal strings
> should be in pure Unicode while conversions to other encodings
> should be done separately, probably best with already existing
> TextEncoders. Those text encoders can be extended to meet wider
> requirements, but strings shall stay strings - they shall contain
> characters only.

Well, this is progress, of a sort. What you write above would imply
that Strings should be arrays of pointers to Character objects. Your
proposal is actually to have strings encoded as ISO 8859-1, UCS-2 or
UCS-4. That's a reasonable optimization to save space, so long as the
semantics of strings are preserved - other objects can't tell what
the internal representation is, because all they see are characters.

But if encapsulation works for fixed length encodings, why not for
UTF-8 or UTF-16?

> By the way, I'm a web developer too and porting Aida to Squeak
> actually started my interest on Unicode support here :)

Yeah, I was wondering about that. Does Aida do a whole lot of work on
string buffers or something? Doesn't it use streams? Why are you so
dead set against variable length encodings?

One other thing: you seem to be advocating that Squeak just adopt the
same design that VisualWorks uses. VisualWorks is great, but it does
have immediate Characters, which Squeak does not. That changes the
design constraints a bit.

Colin

Bert Freudenberg

Re: UTF8 Squeak

On Jun 12, 2007, at 8:29 , Colin Putney wrote:

> Your proposal is actually to have strings encoded as ISO 8859-1,
> UCS-2 or UCS-4.

Actually, the idea is that a String has Unicode throughout, with no
encoding. A string is simply a flat array of Unicode code points.

To optimize space usage we choose the lowest number of bytes per
character that can encompass all code points in a String. This is
implemented as specialized subclasses of String. So for code points
below 256 we use ByteString (8 bit per char), for all others
WideString (32 bits per char). This is purely space optimization, not
a change in encoding.

Now, the proposal is to use an intermediate 2 byte representation for
code points below 65536. Nobody has demonstrated the general
usefulness of this optimization, yet. In particular since the Squeak
VM does not support 16-bit arrays directly but they have to be
emulated using 8 bit words or 32 bit words. For the latter, prims 144
and 145 might help, but the problem of non-even length would have to
be addressed.

Also, the "purity" of Unicode strings does not translate directly
into the implementation, which reserves the most significant byte in
a WideString word for a "language code". That byte is otherwise
unused (code points range from 0 to 16r10FFFF) and is supposed to
help choosing glyph shapes that share a code point but differ in
appearance depending on the language. I suppose this was to restrict
changes to the String hierarchy, a better place for language info
would be text attributes - but then potentially a lot of code would
have to be adapted to pass Texts rather than Strings. It might be
worth to revise that design.

For dealing with encodings perhaps it would be useful to wrap a
ByteArray with a codec into an EncodedString - that way encoded data
could be passed from a webserver and back unmodified. #asString would
use the codec to convert to a proper String, which might also be used
for displaying that EncodedString. I'd not actually make it a String
subclass so perhaps a name other than EncodedString would be better.

My €/50 ...

- Bert -

Michael Rueger-4

Re: UTF8 Squeak

Bert Freudenberg wrote:

> Also, the "purity" of Unicode strings does not translate directly into

Speaking of which...
In Sophie we discovered that Squeak still uses a not quite unicode
mapping in the text converters so we had to roll our own translation
tables (see Sophie-RTF package for the complete list).
As an example the MacRoman conversion below.

Probably to keep the font machinery happy?

Comments? Explanations? Corrections?

Michael

Squeak
#(196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232 234
235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 134 176 162
163 167 149 182 223 174 169 153 180 168 128 198 216 129 177 138 141 165
181 142 143 144 154 157 170 186 158 230 248 191 161 172 166 131 173 178
171 187 133 160 192 195 213 140 156 150 151 147 148 145 146 247 179 255
159 185 164 139 155 188 189 135 183 130 132 137 194 202 193 203 200 205
206 207 204 211 212 190 210 218 219 217 208 136 152 175 215 221 222 184
240 253 254 )

Sophie
#(
196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232
234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252
8224 176 162 163 167 8226 182 223 174 169 8482 180 168 8800 198 216
8734 177 8804 8805 165 181 8706 8721 8719 960 8747 170 186 937 230 248
191 161 172 8730 402 8776 8710 171 187 8230 160 192 195 213 338 339
8211 8212 8220 8221 8216 8217 247 9674 255 376 8260 8364 8249 8250 64257
64258
8225 183 8218 8222 8240 194 202 193 203 200 205 206 207 204 211 212
63743 210 218 219 217 305 710 732 175 728 729 730 184 733 731 711 256)

Yoshiki Ohshima

Re: UTF8 Squeak

Michael,

> Speaking of which...
> In Sophie we discovered that Squeak still uses a not quite unicode
> mapping in the text converters so we had to roll our own translation
> tables (see Sophie-RTF package for the complete list).
> As an example the MacRoman conversion below.
>
> Probably to keep the font machinery happy?
>
> Comments? Explanations? Corrections?

That was to keep the image on Windows VM happy. The table is the
"reverse" of the one in the Windows VM. It was wrong, but it was a
byte to byte and lossless.

What should happen is to have two (or more) table. One is compensate
the table in Windows VM, and another is the correct one like you have.

-- Yoshiki

Bert Freudenberg

Re: UTF8 Squeak

In reply to this post by Michael Rueger-4

On Jun 12, 2007, at 18:32 , Michael Rueger wrote:

> Bert Freudenberg wrote:
>
>> Also, the "purity" of Unicode strings does not translate directly
>> into
>
> Speaking of which...
> In Sophie we discovered that Squeak still uses a not quite unicode
> mapping in the text converters so we had to roll our own
> translation tables (see Sophie-RTF package for the complete list).
> As an example the MacRoman conversion below.
>
> Probably to keep the font machinery happy?
>
> Comments? Explanations? Corrections?

That's just historical ... about 1999. It is a reversible mapping
between MacRoman and Latin1 which maps the characters common to both
to their counterparts, but the rest is just filled up to preserve
different codes while fitting into a byte. Guess we should use a
table like Sophie's nowadays.

> Michael
>
>
> Squeak
> #(196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232
> 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 134
> 176 162 163 167 149 182 223 174 169 153 180 168 128 198 216 129 177
> 138 141 165 181 142 143 144 154 157 170 186 158 230 248 191 161 172
> 166 131 173 178 171 187 133 160 192 195 213 140 156 150 151 147 148
> 145 146 247 179 255 159 185 164 139 155 188 189 135 183 130 132 137
> 194 202 193 203 200 205 206 207 204 211 212 190 210 218 219 217 208
> 136 152 175 215 221 222 184 240 253 254 )
>
> Sophie
> #(
> 196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232
> 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252
> 8224 176 162 163 167 8226 182 223 174 169 8482 180 168 8800 198 216
> 8734 177 8804 8805 165 181 8706 8721 8719 960 8747 170 186 937 230 248
> 191 161 172 8730 402 8776 8710 171 187 8230 160 192 195 213 338 339
> 8211 8212 8220 8221 8216 8217 247 9674 255 376 8260 8364 8249 8250
> 64257 64258
> 8225 183 8218 8222 8240 194 202 193 203 200 205 206 207 204 211 212
> 63743 210 218 219 217 305 710 732 175 728 729 730 184 733 731 711 256)
>

- Bert -

Michael Rueger-4

Re: UTF8 Squeak

In reply to this post by Yoshiki Ohshima

Yoshiki Ohshima wrote:

> That was to keep the image on Windows VM happy. The table is the
> "reverse" of the one in the Windows VM. It was wrong, but it was a
> byte to byte and lossless.
>
> What should happen is to have two (or more) table. One is compensate
> the table in Windows VM, and another is the correct one like you have.

Hmm, or use the new Unicode VM instead of compensating one symptom fix
with another? The above doesn't yield correct input behavior for some
characters and with the newest Unicode VM we should be able to get rid
of it?

Michael

Nicolas Cellier-3

Re: UTF8 Squeak

Michael Rueger a écrit :

> Yoshiki Ohshima wrote:
>
>> That was to keep the image on Windows VM happy. The table is the
>> "reverse" of the one in the Windows VM. It was wrong, but it was a
>> byte to byte and lossless.
>>
>> What should happen is to have two (or more) table. One is compensate
>> the table in Windows VM, and another is the correct one like you have.
>
> Hmm, or use the new Unicode VM instead of compensating one symptom fix
> with another? The above doesn't yield correct input behavior for some
> characters and with the newest Unicode VM we should be able to get rid
> of it?
>
> Michael
>
>

Both Bert and Yoshiki are right, but to be more precise, this is the
latin1 code page from dos (CP1252) see
http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx

So macToSqueak and squeakToMac should rather be named macRomanToCP1252
and cp1252toMacRoman...

So yes, i vote like Michael, fix it!

Otherwise, Squeak is not unicode internally for some of the characters
from 128 to 255, and this contradict this basic assumptions made in this
thread and by naive readers like me...

Nicolas

Nicolas Cellier-3

Re: UTF8 Squeak

nicolas cellier a écrit :
>
> Otherwise, Squeak is not unicode internally for some of the characters
> from 128 to 255, and this contradict this basic assumptions made in this
> thread and by naive readers like me...
>
> Nicolas
>

Spoke to fast: in fact, only characters 16r80 to 16r9F are different
from Unicode. And these codes are free in unicode ISO-10646...

I'm still in favour of true unicode internally, but there is also a
pragmatic point of view... Up to gurus to decide...

Nicolas

Yoshiki Ohshima

Re: UTF8 Squeak

In reply to this post by Nicolas Cellier-3

Nicolas,

> Both Bert and Yoshiki are right, but to be more precise, this is the
> latin1 code page from dos (CP1252) see
> http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx
>
> So macToSqueak and squeakToMac should rather be named macRomanToCP1252
> and cp1252toMacRoman...

Yes. That is more sensible.

> So yes, i vote like Michael, fix it!
>
> Otherwise, Squeak is not unicode internally for some of the characters
> from 128 to 255, and this contradict this basic assumptions made in this
> thread and by naive readers like me...

Otherwise? I'm not sure if this statement is true. I would phrase
it in this way: Squeak is Unicode internally, but some conversion that
happens at the image boundary have bugs. If you create a character
within the range of 160 to 255, you get the right/acceptable glyph for
the character.

-- Yoshiki

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Andreas.Raab

Hi Andreas,

Here is an analysis of one VW "image as a database" which runs a
document portal for quality management system of one of our
pharmaceutical distributors.

image size: 113MB

instances size total avg size byte size
ByteString 355.068 8.562.097 24 9.982.369
TwoByteString 19.848 5.372.602 541 10.824.596

If I remember correctly byte indexed objects have 4 byte header in VW,
therefore:

byte size = 4 bytes per header + size (2*size for TwoByteString)

This should also be rounded up to 4 bytes, which I ignored for now.

Strings therefore contain approx.20% of whole image. If 4B strings would
be used instead of 2B ones, a string space increase would be:

byte size with 2BString: 20.806.965
byte size with 4BString: 31.552.169
increase %: 52%

So, 2x bigger image was really an exaggerated statement but you can see
from those results that image would grow quite extensively if 4B strings
would be used instead of 2B ones.

Best regards
Janko

Andreas Raab wrote:

> Janko Mivšek wrote:
>>> so, all of this talk is for about 4 MB extra (in that image squeak
>>> take 26.8 MB at startup)?.
>>
>> Consider image as a database where you store strings from your
>> application. In that case space efficient but still manipulable
>> strings really matter. For instance, I run one 380MB VW image full of
>> TwoByteStrings and this image would probably have 760M with only
>> FourByteStrings ...
>
> Actually, I would be very interested in a more accurate answer than
> "probably" since the 2x answer assumes that the whole image consists of
> 2-byte strings and that there is zero overhead for headers etc. both of
> which is obviously not the case. If you wouldn't mind, could you run a
> little script that computes the number of characters that are actually
> stored as 2 bytes? Something like:
>
> TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
>
> This strictly counts the number of characters that "matter", i.e., that
> are affected by an encoding change and I'd be interested in getting some
> data point about how that looks in a real application (e.g., whether
> that is in the 10%, 25%, or 50% range). In particular considering that
> VW probably uses the most compact form by default and that there is
> probably quite a bit of application code running and that there is
> probably more than just strings to keep in the data, I'm really curious
> how much of that ends up to be relevant for the 2-byte encoding.
>
> Thanks,
> - Andreas
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Colin Putney

Re: UTF8 Squeak

In reply to this post by Bert Freudenberg

On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:

> On Jun 12, 2007, at 8:29 , Colin Putney wrote:
>
>
>> Your proposal is actually to have strings encoded as ISO 8859-1,
>> UCS-2 or UCS-4.
>>
>
> Actually, the idea is that a String has Unicode throughout, with no
> encoding. A string is simply a flat array of Unicode code points.
>
> To optimize space usage we choose the lowest number of bytes per
> character that can encompass all code points in a String. This is
> implemented as specialized subclasses of String. So for code points
> below 256 we use ByteString (8 bit per char), for all others
> WideString (32 bits per char). This is purely space optimization,
> not a change in encoding.
>

Yes, I understand how m17n was implemented in Squeak. I'm trying to
challenge one of the ideas that underlies Janko's proposal, which you
layout beautifully above: "String has Unicode throughout, with no
encoding." And again at the end: "This is purely space optimization,
not a change in encoding."

If a String were a flat array of Unicode code points, it would be
implemented in Smalltalk as an array of Characters wouldn't it? The
fact that you've chosen to hide the internal representation of the
string and use a "variable byte" or "variable word" subclass to store
bytes, rather than objects, is an indication that the strings *are*
encoded. In fact, the encodings have names: ISO 8859-1 and UCS-4.
Janko is proposing to add a string class that internally stores
strings encoded in UCS-2 to the mix.

So what's so holy about these particular encodings, besides the fact
that they're especially efficient on the VisualWorks VM?

Colin

Masashi UMEZAWA-2

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

Hi Janko,

> >> 1. internally everything is in 16bit Unicode, without any additionally
> >> encoding info attached to strings
> >
> > If they use 16-bit per char, how do they deal with surrogated pairs?
>
> I looked once again and there is actually a FourByteString too. This
> probably answer your question. VW also support Japanese locale well.
>

Just for correction. VW does not support "surrogate pairs" well. A
Character whose value is greater than 65535 would easily crash the
image. This is a quote of Character comment.
--
For character codes between 0 and 65535 (16rFFFF), the Unicode
Character Code Standard is used. Characters with codes between 0 and
255 also coincide with the ISO 8859-1 standard. At present, mappings
for Characters greater than 65535 are undefined, and such characters
are not fully supported. In time, these will probably be defined to
conform to the ISO 10646 superset of Unicode.
---

In VW, Japanese string is represented as TwoByteString. So, it cannot
handle a part of Japanese characters. (But practically, in most cases,
it is enough. And it is also good for reducing memory consumption).

Regards,
--
[:masashi | ^umezawa]

1 ... 34567