New Win32 VM [m17n testers needed]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
124 messages Options
1 ... 34567
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
  Hi, Michael,

> >   - Suppose you would like to use different line wrapping algorithms
> >     for different languages, how would you keep that information?
>
> The question is which, if any, language dependent (text layout?!)
> attributes should be encoded into the String rather than kept as text
> attributes.

  Exactly.  I put my proposal for "a new system" in several emails
ago.  In there, raw data cannot really be displayed without the text
attributes.

  The question was that it was practical to retrofit that idea into
the Squeak system, where gross assumptions was made around Strings
(including being able to a symbol, being able to displayed, etc.)  If
you would like to display any given String in a reasonable way in
Squeak, you needed to have it there with a String.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Javier Diaz-Reinoso
In reply to this post by Yoshiki Ohshima
On 11/06/2007, at 13:27, Yoshiki Ohshima wrote:

>   Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
>   For storing the bare Unicode code points, I think so.  I'm not
> convinced that adding 16-bit variation solves any real problems.  But
> there may be something.
>
>   My first a few questions are:
>
>   - While vast majority of strings for, say, Japanese can be
>     represented with in the characters in BMP, you would use
>     FourByteString for Chinese/Japanese/Korean and some others.  Does
>     this mean that you would *always* use FourByteString for these
>     "languages" (and not scripts?)
>
>   - Suppose you would like to use different line wrapping algorithms
>     for different languages, how would you keep that information?
>
> -- Yoshiki
>
About 2 months ago in the OpenMCL mailing list have this UTF16 vs.  
UTF32 discussion:
> how many angels can dance on a unicode character?
> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>

Gary Byers (the OpenMCL's developer) finish with this conclusion:
> If these numbers are roughly accurate and if the sketch of what
> a displaced SIMPLE-STRING object would look like is realistic,
> then I'd say that using UTF-16 to represent arbitrary Unicode
> characters in a realistic way costs about as much memory-wise
> as using UTF-32 does, is somewhat slower in the simplest cases
> and much slower in general, has very complex boundary
> cases once we step outside the BMP, and just generally doesn't
> seem to have many socially-redeeming qualities that I can see.

perhaps in Squeak is different (no alignment?), but if I doIt:  
(ByteString allInstances collect:[:s | s size] ) sum asFloat (in a  
3.8.1 basic image), I obtain:

1.943098e6, (63672 strings at 30.5 bytes average)

so, all of this talk is for about 4 MB extra (in that image squeak  
take 26.8 MB at startup)?.





Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Klaus D. Witzel
On Mon, 11 Jun 2007 21:29:54 +0200, Javier Diaz-Reinoso wrote:
...
> perhaps in Squeak is different (no alignment?), but if I doIt:  
> (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1  
> basic image), I obtain:
>
> 1.943098e6, (63672 strings at 30.5 bytes average)

You may want to add ByteSymbols, 648278.0 / 38156 16.99, from a squeak-dev  
image.

/Klaus

> so, all of this talk is for about 4 MB extra (in that image squeak take  
> 26.8 MB at startup)?.
>
>
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Colin Putney
Hi Colin,

Colin Putney wrote:

>
> On Jun 11, 2007, at 4:04 AM, Janko Mivšek wrote:
>
>> Anyone can definitively stay with UTF8 encoded strings in plan
>> BytString or subclass to UTF8String by himself. But I don't know why
>> we need to have UTF8String as part of string framework. Just because
>> of meaning? Then we also need to introduce an ASCIIString :)
>
>> I think that preserving simplicity is also an important goal. We need
>> to find a general yet simple solution for Unicode Strings, which will
>> be good enough for most uses, as is the case for numbers for instance.
>> We deal with more special cases separately. I claim that pure Unicode
>> strings in Byte, TwoByte or FourByteString is such a general support.
>> UTF8String is already a specific one.
>
> Ok, so what you're saying is this: ByteString, TwoByteString and
> FourByteString are good enough for the most purposes. Web developers and
> anyone else that needs to work with other encodings should roll their
> own solutions, so as not to burden the rest of the community with
> clutter caused by support for other encodings, or even hooks to make
> such things easy to integrate with the base string code.
>
> Is that a fair characterization of your position?
>
Yes, or just a bit better said: my position is a separation of internal
string representation from encodings. Internal strings should be in pure
Unicode while conversions to other encodings should be done separately,
probably best with already existing TextEncoders. Those text encoders
can be extended to meet wider requirements, but strings shall stay
strings - they shall contain characters only.

By the way, I'm a web developer too and porting Aida to Squeak actually
started my interest on Unicode support here :)

Best regards
JAnko


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Yoshiki Ohshima
Hi Yoshiki,

Yoshiki Ohshima wrote:

>   Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
>   For storing the bare Unicode code points, I think so.  I'm not
> convinced that adding 16-bit variation solves any real problems.  But
> there may be something.

For Slovenian language with Latin 2 script we need to have
TwoByteStrings, same goes for all East Europe, Greek, and Cyrillic. And
because I'm using an image as a database, I just cannot afford 4 byte
strings... And for shorter Slovenian strings even ByteStrings suffice.

>   - While vast majority of strings for, say, Japanese can be
>     represented with in the characters in BMP, you would use
>     FourByteString for Chinese/Japanese/Korean and some others.  Does
>     this mean that you would *always* use FourByteString for these
>     "languages" (and not scripts?)

My proposal allows strings to "scale" to support wider characters, by
widen themselves, from Byte to TwoByte and then FourByteString.

Determination of width of a string is automatic (as is already for
WideString): you start with ByteString and when you put a first
character with code point above 256, a ByteString is automatically
converted to TwoByteString or even FourByteString. Same goes for
TwoByteString when you add a character > 2**16.

Strings therefore don't need to be aware at all about languages they
support.

>   - Suppose you would like to use different line wrapping algorithms
>     for different languages, how would you keep that information?

Line ends internally should be Character cr only (how is that in Squeak
anyway?). Different line-ends are again a responsibility of streams to
the external world.

What about a separate Locale object for all that language specific
information?

Best regards
JAnko


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Javier Diaz-Reinoso
Hi Javier,

Javier Diaz-Reinoso wrote:

> About 2 months ago in the OpenMCL mailing list have this UTF16 vs. UTF32
> discussion:
>> how many angels can dance on a unicode character?
>> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763
>>
>
> Gary Byers (the OpenMCL's developer) finish with this conclusion:
>> If these numbers are roughly accurate and if the sketch of what
>> a displaced SIMPLE-STRING object would look like is realistic,
>> then I'd say that using UTF-16 to represent arbitrary Unicode
>> characters in a realistic way costs about as much memory-wise
>> as using UTF-32 does, is somewhat slower in the simplest cases
>> and much slower in general, has very complex boundary
>> cases once we step outside the BMP, and just generally doesn't
>> seem to have many socially-redeeming qualities that I can see.

> perhaps in Squeak is different (no alignment?), but if I doIt:
> (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1
> basic image), I obtain:
>
> 1.943098e6, (63672 strings at 30.5 bytes average)
>
> so, all of this talk is for about 4 MB extra (in that image squeak take
> 26.8 MB at startup)?.

Consider image as a database where you store strings from your
application. In that case space efficient but still manipulable strings
really matter. For instance, I run one 380MB VW image full of
TwoByteStrings and this image would probably have 760M with only
FourByteStrings ...

Best regards
JAnko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
  Janko,

> Consider image as a database where you store strings from your
> application. In that case space efficient but still manipulable strings
> really matter. For instance, I run one 380MB VW image full of
> TwoByteStrings and this image would probably have 760M with only
> FourByteStrings ...

  Just a thought, but if the space efficiency in "the image as
database" is the biggest reason for you to add 16-bit variation, how
about you just write an optimized version of UTF16TextConverter that
works well for WideString (that would convert WideString from/to
ByteArray), and define #hibernate and #unhibernate methods (or
equivalents) at somewhere to convert it to/from upon image shutdown
time and start up time?  This way, only strings you "touch" (to
display the content to screen, etc.) gets unhibernated to WideString,
and rest of (presumably majority of) strings can stay in the 16-bit
representation...

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Andreas.Raab
In reply to this post by Janko Mivšek
Janko Mivšek wrote:
>> so, all of this talk is for about 4 MB extra (in that image squeak
>> take 26.8 MB at startup)?.
>
> Consider image as a database where you store strings from your
> application. In that case space efficient but still manipulable strings
> really matter. For instance, I run one 380MB VW image full of
> TwoByteStrings and this image would probably have 760M with only
> FourByteStrings ...

Actually, I would be very interested in a more accurate answer than
"probably" since the 2x answer assumes that the whole image consists of
2-byte strings and that there is zero overhead for headers etc. both of
which is obviously not the case. If you wouldn't mind, could you run a
little script that computes the number of characters that are actually
stored as 2 bytes? Something like:

   TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].

This strictly counts the number of characters that "matter", i.e., that
are affected by an encoding change and I'd be interested in getting some
data point about how that looks in a real application (e.g., whether
that is in the 10%, 25%, or 50% range). In particular considering that
VW probably uses the most compact form by default and that there is
probably quite a bit of application code running and that there is
probably more than just strings to keep in the data, I'm really curious
how much of that ends up to be relevant for the 2-byte encoding.

Thanks,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Colin Putney
In reply to this post by Janko Mivšek

On Jun 11, 2007, at 1:31 PM, Janko Mivšek wrote:

>> Is that a fair characterization of your position?
> Yes, or just a bit better said: my position is a separation of  
> internal string representation from encodings. Internal strings  
> should be in pure Unicode while conversions to other encodings  
> should be done separately, probably best with already existing  
> TextEncoders. Those text encoders can be extended to meet wider  
> requirements, but strings shall stay strings - they shall contain  
> characters only.

Well, this is progress, of a sort. What you write above would imply  
that Strings should be arrays of pointers to Character objects. Your  
proposal is actually to have strings encoded as ISO 8859-1, UCS-2 or  
UCS-4. That's a reasonable optimization to save space, so long as the  
semantics of strings are preserved - other objects can't tell what  
the internal representation is, because all they see are characters.

But if encapsulation works for fixed length encodings, why not for  
UTF-8 or UTF-16?

> By the way, I'm a web developer too and porting Aida to Squeak  
> actually started my interest on Unicode support here :)

Yeah, I was wondering about that. Does Aida do a whole lot of work on  
string buffers or something? Doesn't it use streams? Why are you so  
dead set against variable length encodings?

One other thing: you seem to be advocating that Squeak just adopt the  
same design that VisualWorks uses. VisualWorks is great, but it does  
have immediate Characters, which Squeak does not. That changes the  
design constraints a bit.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Bert Freudenberg
On Jun 12, 2007, at 8:29 , Colin Putney wrote:

> Your proposal is actually to have strings encoded as ISO 8859-1,  
> UCS-2 or UCS-4.

Actually, the idea is that a String has Unicode throughout, with no  
encoding. A string is simply a flat array of Unicode code points.

To optimize space usage we choose the lowest number of bytes per  
character that can encompass all code points in a String. This is  
implemented as specialized subclasses of String. So for code points  
below 256 we use ByteString (8 bit per char), for all others  
WideString (32 bits per char). This is purely space optimization, not  
a change in encoding.

Now, the proposal is to use an intermediate 2 byte representation for  
code points below 65536. Nobody has demonstrated the general  
usefulness of this optimization, yet. In particular since the Squeak  
VM does not support 16-bit arrays directly but they have to be  
emulated using 8 bit words or 32 bit words. For the latter, prims 144  
and 145 might help, but the problem of non-even length would have to  
be addressed.

Also, the "purity" of Unicode strings does not translate directly  
into the implementation, which reserves the most significant byte in  
a WideString word for a "language code". That byte is otherwise  
unused (code points range from 0 to 16r10FFFF) and is supposed to  
help choosing glyph shapes that share a code point but differ in  
appearance depending on the language. I suppose this was to restrict  
changes to the String hierarchy, a better place for language info  
would be text attributes - but then potentially a lot of code would  
have to be adapted to pass Texts rather than Strings. It might be  
worth to revise that design.

For dealing with encodings perhaps it would be useful to wrap a  
ByteArray with a codec into an EncodedString - that way encoded data  
could be passed from a webserver and back unmodified. #asString would  
use the codec to convert to a proper String, which might also be used  
for displaying that EncodedString. I'd not actually make it a String  
subclass so perhaps a name other than EncodedString would be better.

My €/50 ...

- Bert -




Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Michael Rueger-4
Bert Freudenberg wrote:

> Also, the "purity" of Unicode strings does not translate directly into

Speaking of which...
In Sophie we discovered that Squeak still uses a not quite unicode
mapping in the text converters so we had to roll our own translation
tables (see Sophie-RTF package for the complete list).
As an example the MacRoman conversion below.

Probably to keep the font machinery happy?

Comments? Explanations? Corrections?

Michael


Squeak
#(196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232 234
235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 134 176 162
163 167 149 182 223 174 169 153 180 168 128 198 216 129 177 138 141 165
181 142 143 144 154 157 170 186 158 230 248 191 161 172 166 131 173 178
171 187 133 160 192 195 213 140 156 150 151 147 148 145 146 247 179 255
159 185 164 139 155 188 189 135 183 130 132 137 194 202 193 203 200 205
206 207 204 211 212 190 210 218 219 217 208 136 152 175 215 221 222 184
240 253 254 )

Sophie
#(
196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232
234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252
8224 176 162 163 167 8226 182 223 174 169 8482 180 168 8800 198 216
8734 177 8804 8805 165 181 8706 8721 8719 960 8747 170 186 937 230 248
191 161 172 8730 402 8776 8710 171 187 8230 160 192 195 213 338 339
8211 8212 8220 8221 8216 8217 247 9674 255 376 8260 8364 8249 8250 64257
64258
8225 183 8218 8222 8240 194 202 193 203 200 205 206 207 204 211 212
63743 210 218 219 217 305 710 732 175 728 729 730 184 733 731 711 256)

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
  Michael,

> Speaking of which...
> In Sophie we discovered that Squeak still uses a not quite unicode
> mapping in the text converters so we had to roll our own translation
> tables (see Sophie-RTF package for the complete list).
> As an example the MacRoman conversion below.
>
> Probably to keep the font machinery happy?
>
> Comments? Explanations? Corrections?

  That was to keep the image on Windows VM happy.  The table is the
"reverse" of the one in the Windows VM.  It was wrong, but it was a
byte to byte and lossless.

  What should happen is to have two (or more) table. One is compensate
the table in Windows VM, and another is the correct one like you have.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Bert Freudenberg
In reply to this post by Michael Rueger-4

On Jun 12, 2007, at 18:32 , Michael Rueger wrote:

> Bert Freudenberg wrote:
>
>> Also, the "purity" of Unicode strings does not translate directly  
>> into
>
> Speaking of which...
> In Sophie we discovered that Squeak still uses a not quite unicode  
> mapping in the text converters so we had to roll our own  
> translation tables (see Sophie-RTF package for the complete list).
> As an example the MacRoman conversion below.
>
> Probably to keep the font machinery happy?
>
> Comments? Explanations? Corrections?

That's just historical ... about 1999. It is a reversible mapping  
between MacRoman and Latin1 which maps the characters common to both  
to their counterparts, but the rest is just filled up to preserve  
different codes while fitting into a byte. Guess we should use a  
table like Sophie's nowadays.


> Michael
>
>
> Squeak
> #(196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232  
> 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 134  
> 176 162 163 167 149 182 223 174 169 153 180 168 128 198 216 129 177  
> 138 141 165 181 142 143 144 154 157 170 186 158 230 248 191 161 172  
> 166 131 173 178 171 187 133 160 192 195 213 140 156 150 151 147 148  
> 145 146 247 179 255 159 185 164 139 155 188 189 135 183 130 132 137  
> 194 202 193 203 200 205 206 207 204 211 212 190 210 218 219 217 208  
> 136 152 175 215 221 222 184 240 253 254 )
>
> Sophie
> #(
> 196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232
> 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252
> 8224 176 162 163 167 8226 182 223 174 169 8482 180 168 8800 198 216
> 8734 177 8804 8805 165 181 8706 8721 8719 960 8747 170 186 937 230 248
> 191 161 172 8730 402 8776 8710 171 187 8230 160 192 195 213 338 339
> 8211 8212 8220 8221 8216 8217 247 9674 255 376 8260 8364 8249 8250  
> 64257 64258
> 8225 183 8218 8222 8240 194 202 193 203 200 205 206 207 204 211 212
> 63743 210 218 219 217 305 710 732 175 728 729 730 184 733 731 711 256)
>

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Michael Rueger-4
In reply to this post by Yoshiki Ohshima
Yoshiki Ohshima wrote:

>   That was to keep the image on Windows VM happy.  The table is the
> "reverse" of the one in the Windows VM.  It was wrong, but it was a
> byte to byte and lossless.
>
>   What should happen is to have two (or more) table. One is compensate
> the table in Windows VM, and another is the correct one like you have.

Hmm, or use the new Unicode VM instead of compensating one symptom fix
with another? The above doesn't yield correct input behavior for some
characters and with the newest Unicode VM we should be able to get rid
of it?

Michael

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Nicolas Cellier-3
Michael Rueger a écrit :

> Yoshiki Ohshima wrote:
>
>>   That was to keep the image on Windows VM happy.  The table is the
>> "reverse" of the one in the Windows VM.  It was wrong, but it was a
>> byte to byte and lossless.
>>
>>   What should happen is to have two (or more) table. One is compensate
>> the table in Windows VM, and another is the correct one like you have.
>
> Hmm, or use the new Unicode VM instead of compensating one symptom fix
> with another? The above doesn't yield correct input behavior for some
> characters and with the newest Unicode VM we should be able to get rid
> of it?
>
> Michael
>
>

Both Bert and Yoshiki are right, but to be more precise, this is the
latin1 code page from dos (CP1252) see
http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx

So macToSqueak and squeakToMac should rather be named macRomanToCP1252
and cp1252toMacRoman...

So yes, i vote like Michael, fix it!

Otherwise, Squeak is not unicode internally for some of the characters
from 128 to 255, and this contradict this basic assumptions made in this
thread and by naive readers like me...

Nicolas


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Nicolas Cellier-3
nicolas cellier a écrit :
>
> Otherwise, Squeak is not unicode internally for some of the characters
> from 128 to 255, and this contradict this basic assumptions made in this
> thread and by naive readers like me...
>
> Nicolas
>

Spoke to fast: in fact, only characters 16r80 to 16r9F are different
from Unicode. And these codes are free in unicode ISO-10646...

I'm still in favour of true unicode internally, but there is also a
pragmatic point of view... Up to gurus to decide...

Nicolas


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
In reply to this post by Nicolas Cellier-3
  Nicolas,

> Both Bert and Yoshiki are right, but to be more precise, this is the
> latin1 code page from dos (CP1252) see
> http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx
>
> So macToSqueak and squeakToMac should rather be named macRomanToCP1252
> and cp1252toMacRoman...

  Yes.  That is more sensible.

> So yes, i vote like Michael, fix it!
>
> Otherwise, Squeak is not unicode internally for some of the characters
> from 128 to 255, and this contradict this basic assumptions made in this
> thread and by naive readers like me...

  Otherwise?  I'm not sure if this statement is true.  I would phrase
it in this way: Squeak is Unicode internally, but some conversion that
happens at the image boundary have bugs.  If you create a character
within the range of 160 to 255, you get the right/acceptable glyph for
the character.

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Andreas.Raab
Hi Andreas,

Here is an analysis of one VW "image as a database" which runs a
document portal for quality management system of one of our
pharmaceutical distributors.

image size: 113MB

                 instances  size total  avg size   byte size
ByteString      355.068     8.562.097     24      9.982.369
TwoByteString    19.848     5.372.602    541     10.824.596

If I remember correctly byte indexed objects have 4 byte header in VW,
therefore:

byte size = 4 bytes per header + size (2*size for TwoByteString)

This should also be rounded up to 4 bytes, which I ignored for now.

Strings therefore contain approx.20% of whole image. If 4B strings would
be used instead of 2B ones, a string space increase would be:

        byte size with 2BString: 20.806.965
        byte size with 4BString: 31.552.169
                     increase %:        52%

So, 2x bigger image was really an exaggerated statement but you can see
from those results that image would grow quite extensively if 4B strings
would be used instead of 2B ones.

Best regards
Janko


Andreas Raab wrote:

> Janko Mivšek wrote:
>>> so, all of this talk is for about 4 MB extra (in that image squeak
>>> take 26.8 MB at startup)?.
>>
>> Consider image as a database where you store strings from your
>> application. In that case space efficient but still manipulable
>> strings really matter. For instance, I run one 380MB VW image full of
>> TwoByteStrings and this image would probably have 760M with only
>> FourByteStrings ...
>
> Actually, I would be very interested in a more accurate answer than
> "probably" since the 2x answer assumes that the whole image consists of
> 2-byte strings and that there is zero overhead for headers etc. both of
> which is obviously not the case. If you wouldn't mind, could you run a
> little script that computes the number of characters that are actually
> stored as 2 bytes? Something like:
>
>   TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size].
>
> This strictly counts the number of characters that "matter", i.e., that
> are affected by an encoding change and I'd be interested in getting some
> data point about how that looks in a real application (e.g., whether
> that is in the 10%, 25%, or 50% range). In particular considering that
> VW probably uses the most compact form by default and that there is
> probably quite a bit of application code running and that there is
> probably more than just strings to keep in the data, I'm really curious
> how much of that ends up to be relevant for the 2-byte encoding.
>
> Thanks,
>   - Andreas
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Colin Putney
In reply to this post by Bert Freudenberg
On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:


> On Jun 12, 2007, at 8:29 , Colin Putney wrote:
>
>
>> Your proposal is actually to have strings encoded as ISO 8859-1,  
>> UCS-2 or UCS-4.
>>
>
> Actually, the idea is that a String has Unicode throughout, with no  
> encoding. A string is simply a flat array of Unicode code points.
>
> To optimize space usage we choose the lowest number of bytes per  
> character that can encompass all code points in a String. This is  
> implemented as specialized subclasses of String. So for code points  
> below 256 we use ByteString (8 bit per char), for all others  
> WideString (32 bits per char). This is purely space optimization,  
> not a change in encoding.
>

Yes, I understand how m17n was implemented in Squeak. I'm trying to  
challenge one of the ideas that underlies Janko's proposal, which you  
layout beautifully above: "String has Unicode throughout, with no  
encoding." And again at the end: "This is purely space optimization,  
not a change in encoding."

If a String were a flat array of Unicode code points, it would be  
implemented in Smalltalk as an array of Characters wouldn't it? The  
fact that you've chosen to hide the internal representation of the  
string and use a "variable byte" or "variable word" subclass to store  
bytes, rather than objects, is an indication that the strings *are*  
encoded. In fact, the encodings have names: ISO 8859-1 and UCS-4.  
Janko is proposing to add a string class that internally stores  
strings encoded in UCS-2 to the mix.

So what's so holy about these particular encodings, besides the fact  
that they're especially efficient on the VisualWorks VM?

Colin


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Masashi UMEZAWA-2
In reply to this post by Janko Mivšek
Hi Janko,

> >> 1. internally everything is in 16bit Unicode, without any additionally
> >>     encoding info attached to strings
> >
> >   If they use 16-bit per char, how do they deal with surrogated pairs?
>
> I looked once again and there is actually a FourByteString too. This
> probably answer your question. VW also support Japanese locale well.
>

Just for correction. VW does not support "surrogate pairs" well. A
Character whose value is greater than 65535 would easily crash the
image. This is a quote of Character comment.
--
For character codes between 0 and 65535 (16rFFFF), the Unicode
Character Code Standard is used.  Characters with codes between 0 and
255 also coincide with the ISO 8859-1 standard. At present, mappings
for Characters greater than 65535 are undefined, and such characters
are not fully supported. In time, these will probably be defined to
conform to the ISO 10646 superset of Unicode.
 ---

In VW, Japanese string is represented as TwoByteString. So, it cannot
handle a part of Japanese characters. (But practically, in most cases,
it is enough. And it is also good for reducing memory consumption).

Regards,
--
[:masashi | ^umezawa]

1 ... 34567