Hi, Michael,
> > - Suppose you would like to use different line wrapping algorithms > > for different languages, how would you keep that information? > > The question is which, if any, language dependent (text layout?!) > attributes should be encoded into the String rather than kept as text > attributes. Exactly. I put my proposal for "a new system" in several emails ago. In there, raw data cannot really be displayed without the text attributes. The question was that it was practical to retrofit that idea into the Squeak system, where gross assumptions was made around Strings (including being able to a symbol, being able to displayed, etc.) If you would like to display any given String in a reasonable way in Squeak, you needed to have it there with a String. -- Yoshiki |
In reply to this post by Yoshiki Ohshima
On 11/06/2007, at 13:27, Yoshiki Ohshima wrote:
> Janko, > >> It seems that this was already a Yoshiki idea with WideString, so I'm >> just extending that idea with a TwoByteString to cover 16 bits too. >> >> Yoshiki, am I right? > > For storing the bare Unicode code points, I think so. I'm not > convinced that adding 16-bit variation solves any real problems. But > there may be something. > > My first a few questions are: > > - While vast majority of strings for, say, Japanese can be > represented with in the characters in BMP, you would use > FourByteString for Chinese/Japanese/Korean and some others. Does > this mean that you would *always* use FourByteString for these > "languages" (and not scripts?) > > - Suppose you would like to use different line wrapping algorithms > for different languages, how would you keep that information? > > -- Yoshiki > UTF32 discussion: > how many angels can dance on a unicode character? > http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763 > Gary Byers (the OpenMCL's developer) finish with this conclusion: > If these numbers are roughly accurate and if the sketch of what > a displaced SIMPLE-STRING object would look like is realistic, > then I'd say that using UTF-16 to represent arbitrary Unicode > characters in a realistic way costs about as much memory-wise > as using UTF-32 does, is somewhat slower in the simplest cases > and much slower in general, has very complex boundary > cases once we step outside the BMP, and just generally doesn't > seem to have many socially-redeeming qualities that I can see. perhaps in Squeak is different (no alignment?), but if I doIt: (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1 basic image), I obtain: 1.943098e6, (63672 strings at 30.5 bytes average) so, all of this talk is for about 4 MB extra (in that image squeak take 26.8 MB at startup)?. |
On Mon, 11 Jun 2007 21:29:54 +0200, Javier Diaz-Reinoso wrote:
... > perhaps in Squeak is different (no alignment?), but if I doIt: > (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1 > basic image), I obtain: > > 1.943098e6, (63672 strings at 30.5 bytes average) You may want to add ByteSymbols, 648278.0 / 38156 16.99, from a squeak-dev image. /Klaus > so, all of this talk is for about 4 MB extra (in that image squeak take > 26.8 MB at startup)?. > > > > > > |
In reply to this post by Colin Putney
Hi Colin,
Colin Putney wrote: > > On Jun 11, 2007, at 4:04 AM, Janko Mivšek wrote: > >> Anyone can definitively stay with UTF8 encoded strings in plan >> BytString or subclass to UTF8String by himself. But I don't know why >> we need to have UTF8String as part of string framework. Just because >> of meaning? Then we also need to introduce an ASCIIString :) > >> I think that preserving simplicity is also an important goal. We need >> to find a general yet simple solution for Unicode Strings, which will >> be good enough for most uses, as is the case for numbers for instance. >> We deal with more special cases separately. I claim that pure Unicode >> strings in Byte, TwoByte or FourByteString is such a general support. >> UTF8String is already a specific one. > > Ok, so what you're saying is this: ByteString, TwoByteString and > FourByteString are good enough for the most purposes. Web developers and > anyone else that needs to work with other encodings should roll their > own solutions, so as not to burden the rest of the community with > clutter caused by support for other encodings, or even hooks to make > such things easy to integrate with the base string code. > > Is that a fair characterization of your position? > string representation from encodings. Internal strings should be in pure Unicode while conversions to other encodings should be done separately, probably best with already existing TextEncoders. Those text encoders can be extended to meet wider requirements, but strings shall stay strings - they shall contain characters only. By the way, I'm a web developer too and porting Aida to Squeak actually started my interest on Unicode support here :) Best regards JAnko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by Yoshiki Ohshima
Hi Yoshiki,
Yoshiki Ohshima wrote: > Janko, > >> It seems that this was already a Yoshiki idea with WideString, so I'm >> just extending that idea with a TwoByteString to cover 16 bits too. >> >> Yoshiki, am I right? > > For storing the bare Unicode code points, I think so. I'm not > convinced that adding 16-bit variation solves any real problems. But > there may be something. For Slovenian language with Latin 2 script we need to have TwoByteStrings, same goes for all East Europe, Greek, and Cyrillic. And because I'm using an image as a database, I just cannot afford 4 byte strings... And for shorter Slovenian strings even ByteStrings suffice. > - While vast majority of strings for, say, Japanese can be > represented with in the characters in BMP, you would use > FourByteString for Chinese/Japanese/Korean and some others. Does > this mean that you would *always* use FourByteString for these > "languages" (and not scripts?) My proposal allows strings to "scale" to support wider characters, by widen themselves, from Byte to TwoByte and then FourByteString. Determination of width of a string is automatic (as is already for WideString): you start with ByteString and when you put a first character with code point above 256, a ByteString is automatically converted to TwoByteString or even FourByteString. Same goes for TwoByteString when you add a character > 2**16. Strings therefore don't need to be aware at all about languages they support. > - Suppose you would like to use different line wrapping algorithms > for different languages, how would you keep that information? Line ends internally should be Character cr only (how is that in Squeak anyway?). Different line-ends are again a responsibility of streams to the external world. What about a separate Locale object for all that language specific information? Best regards JAnko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by Javier Diaz-Reinoso
Hi Javier,
Javier Diaz-Reinoso wrote: > About 2 months ago in the OpenMCL mailing list have this UTF16 vs. UTF32 > discussion: >> how many angels can dance on a unicode character? >> http://thread.gmane.org/gmane.lisp.openmcl.devel/1756/focus=1763 >> > > Gary Byers (the OpenMCL's developer) finish with this conclusion: >> If these numbers are roughly accurate and if the sketch of what >> a displaced SIMPLE-STRING object would look like is realistic, >> then I'd say that using UTF-16 to represent arbitrary Unicode >> characters in a realistic way costs about as much memory-wise >> as using UTF-32 does, is somewhat slower in the simplest cases >> and much slower in general, has very complex boundary >> cases once we step outside the BMP, and just generally doesn't >> seem to have many socially-redeeming qualities that I can see. > perhaps in Squeak is different (no alignment?), but if I doIt: > (ByteString allInstances collect:[:s | s size] ) sum asFloat (in a 3.8.1 > basic image), I obtain: > > 1.943098e6, (63672 strings at 30.5 bytes average) > > so, all of this talk is for about 4 MB extra (in that image squeak take > 26.8 MB at startup)?. Consider image as a database where you store strings from your application. In that case space efficient but still manipulable strings really matter. For instance, I run one 380MB VW image full of TwoByteStrings and this image would probably have 760M with only FourByteStrings ... Best regards JAnko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
Janko,
> Consider image as a database where you store strings from your > application. In that case space efficient but still manipulable strings > really matter. For instance, I run one 380MB VW image full of > TwoByteStrings and this image would probably have 760M with only > FourByteStrings ... Just a thought, but if the space efficiency in "the image as database" is the biggest reason for you to add 16-bit variation, how about you just write an optimized version of UTF16TextConverter that works well for WideString (that would convert WideString from/to ByteArray), and define #hibernate and #unhibernate methods (or equivalents) at somewhere to convert it to/from upon image shutdown time and start up time? This way, only strings you "touch" (to display the content to screen, etc.) gets unhibernated to WideString, and rest of (presumably majority of) strings can stay in the 16-bit representation... -- Yoshiki |
In reply to this post by Janko Mivšek
Janko Mivšek wrote:
>> so, all of this talk is for about 4 MB extra (in that image squeak >> take 26.8 MB at startup)?. > > Consider image as a database where you store strings from your > application. In that case space efficient but still manipulable strings > really matter. For instance, I run one 380MB VW image full of > TwoByteStrings and this image would probably have 760M with only > FourByteStrings ... Actually, I would be very interested in a more accurate answer than "probably" since the 2x answer assumes that the whole image consists of 2-byte strings and that there is zero overhead for headers etc. both of which is obviously not the case. If you wouldn't mind, could you run a little script that computes the number of characters that are actually stored as 2 bytes? Something like: TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size]. This strictly counts the number of characters that "matter", i.e., that are affected by an encoding change and I'd be interested in getting some data point about how that looks in a real application (e.g., whether that is in the 10%, 25%, or 50% range). In particular considering that VW probably uses the most compact form by default and that there is probably quite a bit of application code running and that there is probably more than just strings to keep in the data, I'm really curious how much of that ends up to be relevant for the 2-byte encoding. Thanks, - Andreas |
In reply to this post by Janko Mivšek
On Jun 11, 2007, at 1:31 PM, Janko Mivšek wrote: >> Is that a fair characterization of your position? > Yes, or just a bit better said: my position is a separation of > internal string representation from encodings. Internal strings > should be in pure Unicode while conversions to other encodings > should be done separately, probably best with already existing > TextEncoders. Those text encoders can be extended to meet wider > requirements, but strings shall stay strings - they shall contain > characters only. Well, this is progress, of a sort. What you write above would imply that Strings should be arrays of pointers to Character objects. Your proposal is actually to have strings encoded as ISO 8859-1, UCS-2 or UCS-4. That's a reasonable optimization to save space, so long as the semantics of strings are preserved - other objects can't tell what the internal representation is, because all they see are characters. But if encapsulation works for fixed length encodings, why not for UTF-8 or UTF-16? > By the way, I'm a web developer too and porting Aida to Squeak > actually started my interest on Unicode support here :) Yeah, I was wondering about that. Does Aida do a whole lot of work on string buffers or something? Doesn't it use streams? Why are you so dead set against variable length encodings? One other thing: you seem to be advocating that Squeak just adopt the same design that VisualWorks uses. VisualWorks is great, but it does have immediate Characters, which Squeak does not. That changes the design constraints a bit. Colin |
On Jun 12, 2007, at 8:29 , Colin Putney wrote:
> Your proposal is actually to have strings encoded as ISO 8859-1, > UCS-2 or UCS-4. Actually, the idea is that a String has Unicode throughout, with no encoding. A string is simply a flat array of Unicode code points. To optimize space usage we choose the lowest number of bytes per character that can encompass all code points in a String. This is implemented as specialized subclasses of String. So for code points below 256 we use ByteString (8 bit per char), for all others WideString (32 bits per char). This is purely space optimization, not a change in encoding. Now, the proposal is to use an intermediate 2 byte representation for code points below 65536. Nobody has demonstrated the general usefulness of this optimization, yet. In particular since the Squeak VM does not support 16-bit arrays directly but they have to be emulated using 8 bit words or 32 bit words. For the latter, prims 144 and 145 might help, but the problem of non-even length would have to be addressed. Also, the "purity" of Unicode strings does not translate directly into the implementation, which reserves the most significant byte in a WideString word for a "language code". That byte is otherwise unused (code points range from 0 to 16r10FFFF) and is supposed to help choosing glyph shapes that share a code point but differ in appearance depending on the language. I suppose this was to restrict changes to the String hierarchy, a better place for language info would be text attributes - but then potentially a lot of code would have to be adapted to pass Texts rather than Strings. It might be worth to revise that design. For dealing with encodings perhaps it would be useful to wrap a ByteArray with a codec into an EncodedString - that way encoded data could be passed from a webserver and back unmodified. #asString would use the codec to convert to a proper String, which might also be used for displaying that EncodedString. I'd not actually make it a String subclass so perhaps a name other than EncodedString would be better. My €/50 ... - Bert - |
Bert Freudenberg wrote:
> Also, the "purity" of Unicode strings does not translate directly into Speaking of which... In Sophie we discovered that Squeak still uses a not quite unicode mapping in the text converters so we had to roll our own translation tables (see Sophie-RTF package for the complete list). As an example the MacRoman conversion below. Probably to keep the font machinery happy? Comments? Explanations? Corrections? Michael Squeak #(196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 134 176 162 163 167 149 182 223 174 169 153 180 168 128 198 216 129 177 138 141 165 181 142 143 144 154 157 170 186 158 230 248 191 161 172 166 131 173 178 171 187 133 160 192 195 213 140 156 150 151 147 148 145 146 247 179 255 159 185 164 139 155 188 189 135 183 130 132 137 194 202 193 203 200 205 206 207 204 211 212 190 210 218 219 217 208 136 152 175 215 221 222 184 240 253 254 ) Sophie #( 196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 8224 176 162 163 167 8226 182 223 174 169 8482 180 168 8800 198 216 8734 177 8804 8805 165 181 8706 8721 8719 960 8747 170 186 937 230 248 191 161 172 8730 402 8776 8710 171 187 8230 160 192 195 213 338 339 8211 8212 8220 8221 8216 8217 247 9674 255 376 8260 8364 8249 8250 64257 64258 8225 183 8218 8222 8240 194 202 193 203 200 205 206 207 204 211 212 63743 210 218 219 217 305 710 732 175 728 729 730 184 733 731 711 256) |
Michael,
> Speaking of which... > In Sophie we discovered that Squeak still uses a not quite unicode > mapping in the text converters so we had to roll our own translation > tables (see Sophie-RTF package for the complete list). > As an example the MacRoman conversion below. > > Probably to keep the font machinery happy? > > Comments? Explanations? Corrections? That was to keep the image on Windows VM happy. The table is the "reverse" of the one in the Windows VM. It was wrong, but it was a byte to byte and lossless. What should happen is to have two (or more) table. One is compensate the table in Windows VM, and another is the correct one like you have. -- Yoshiki |
In reply to this post by Michael Rueger-4
On Jun 12, 2007, at 18:32 , Michael Rueger wrote: > Bert Freudenberg wrote: > >> Also, the "purity" of Unicode strings does not translate directly >> into > > Speaking of which... > In Sophie we discovered that Squeak still uses a not quite unicode > mapping in the text converters so we had to roll our own > translation tables (see Sophie-RTF package for the complete list). > As an example the MacRoman conversion below. > > Probably to keep the font machinery happy? > > Comments? Explanations? Corrections? That's just historical ... about 1999. It is a reversible mapping between MacRoman and Latin1 which maps the characters common to both to their counterparts, but the rest is just filled up to preserve different codes while fitting into a byte. Guess we should use a table like Sophie's nowadays. > Michael > > > Squeak > #(196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232 > 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 134 > 176 162 163 167 149 182 223 174 169 153 180 168 128 198 216 129 177 > 138 141 165 181 142 143 144 154 157 170 186 158 230 248 191 161 172 > 166 131 173 178 171 187 133 160 192 195 213 140 156 150 151 147 148 > 145 146 247 179 255 159 185 164 139 155 188 189 135 183 130 132 137 > 194 202 193 203 200 205 206 207 204 211 212 190 210 218 219 217 208 > 136 152 175 215 221 222 184 240 253 254 ) > > Sophie > #( > 196 197 199 201 209 214 220 225 224 226 228 227 229 231 233 232 > 234 235 237 236 238 239 241 243 242 244 246 245 250 249 251 252 > 8224 176 162 163 167 8226 182 223 174 169 8482 180 168 8800 198 216 > 8734 177 8804 8805 165 181 8706 8721 8719 960 8747 170 186 937 230 248 > 191 161 172 8730 402 8776 8710 171 187 8230 160 192 195 213 338 339 > 8211 8212 8220 8221 8216 8217 247 9674 255 376 8260 8364 8249 8250 > 64257 64258 > 8225 183 8218 8222 8240 194 202 193 203 200 205 206 207 204 211 212 > 63743 210 218 219 217 305 710 732 175 728 729 730 184 733 731 711 256) > - Bert - |
In reply to this post by Yoshiki Ohshima
Yoshiki Ohshima wrote:
> That was to keep the image on Windows VM happy. The table is the > "reverse" of the one in the Windows VM. It was wrong, but it was a > byte to byte and lossless. > > What should happen is to have two (or more) table. One is compensate > the table in Windows VM, and another is the correct one like you have. Hmm, or use the new Unicode VM instead of compensating one symptom fix with another? The above doesn't yield correct input behavior for some characters and with the newest Unicode VM we should be able to get rid of it? Michael |
Michael Rueger a écrit :
> Yoshiki Ohshima wrote: > >> That was to keep the image on Windows VM happy. The table is the >> "reverse" of the one in the Windows VM. It was wrong, but it was a >> byte to byte and lossless. >> >> What should happen is to have two (or more) table. One is compensate >> the table in Windows VM, and another is the correct one like you have. > > Hmm, or use the new Unicode VM instead of compensating one symptom fix > with another? The above doesn't yield correct input behavior for some > characters and with the newest Unicode VM we should be able to get rid > of it? > > Michael > > Both Bert and Yoshiki are right, but to be more precise, this is the latin1 code page from dos (CP1252) see http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx So macToSqueak and squeakToMac should rather be named macRomanToCP1252 and cp1252toMacRoman... So yes, i vote like Michael, fix it! Otherwise, Squeak is not unicode internally for some of the characters from 128 to 255, and this contradict this basic assumptions made in this thread and by naive readers like me... Nicolas |
nicolas cellier a écrit :
> > Otherwise, Squeak is not unicode internally for some of the characters > from 128 to 255, and this contradict this basic assumptions made in this > thread and by naive readers like me... > > Nicolas > Spoke to fast: in fact, only characters 16r80 to 16r9F are different from Unicode. And these codes are free in unicode ISO-10646... I'm still in favour of true unicode internally, but there is also a pragmatic point of view... Up to gurus to decide... Nicolas |
In reply to this post by Nicolas Cellier-3
Nicolas,
> Both Bert and Yoshiki are right, but to be more precise, this is the > latin1 code page from dos (CP1252) see > http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx > > So macToSqueak and squeakToMac should rather be named macRomanToCP1252 > and cp1252toMacRoman... Yes. That is more sensible. > So yes, i vote like Michael, fix it! > > Otherwise, Squeak is not unicode internally for some of the characters > from 128 to 255, and this contradict this basic assumptions made in this > thread and by naive readers like me... Otherwise? I'm not sure if this statement is true. I would phrase it in this way: Squeak is Unicode internally, but some conversion that happens at the image boundary have bugs. If you create a character within the range of 160 to 255, you get the right/acceptable glyph for the character. -- Yoshiki |
In reply to this post by Andreas.Raab
Hi Andreas,
Here is an analysis of one VW "image as a database" which runs a document portal for quality management system of one of our pharmaceutical distributors. image size: 113MB instances size total avg size byte size ByteString 355.068 8.562.097 24 9.982.369 TwoByteString 19.848 5.372.602 541 10.824.596 If I remember correctly byte indexed objects have 4 byte header in VW, therefore: byte size = 4 bytes per header + size (2*size for TwoByteString) This should also be rounded up to 4 bytes, which I ignored for now. Strings therefore contain approx.20% of whole image. If 4B strings would be used instead of 2B ones, a string space increase would be: byte size with 2BString: 20.806.965 byte size with 4BString: 31.552.169 increase %: 52% So, 2x bigger image was really an exaggerated statement but you can see from those results that image would grow quite extensively if 4B strings would be used instead of 2B ones. Best regards Janko Andreas Raab wrote: > Janko Mivšek wrote: >>> so, all of this talk is for about 4 MB extra (in that image squeak >>> take 26.8 MB at startup)?. >> >> Consider image as a database where you store strings from your >> application. In that case space efficient but still manipulable >> strings really matter. For instance, I run one 380MB VW image full of >> TwoByteStrings and this image would probably have 760M with only >> FourByteStrings ... > > Actually, I would be very interested in a more accurate answer than > "probably" since the 2x answer assumes that the whole image consists of > 2-byte strings and that there is zero overhead for headers etc. both of > which is obviously not the case. If you wouldn't mind, could you run a > little script that computes the number of characters that are actually > stored as 2 bytes? Something like: > > TwoByteString allInstances inject: 0 into:[:sum :str| sum + str size]. > > This strictly counts the number of characters that "matter", i.e., that > are affected by an encoding change and I'd be interested in getting some > data point about how that looks in a real application (e.g., whether > that is in the 10%, 25%, or 50% range). In particular considering that > VW probably uses the most compact form by default and that there is > probably quite a bit of application code running and that there is > probably more than just strings to keep in the data, I'm really curious > how much of that ends up to be relevant for the 2-byte encoding. > > Thanks, > - Andreas > > -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by Bert Freudenberg
On Jun 12, 2007, at 4:28 AM, Bert Freudenberg wrote:
> On Jun 12, 2007, at 8:29 , Colin Putney wrote: > > >> Your proposal is actually to have strings encoded as ISO 8859-1, >> UCS-2 or UCS-4. >> > > Actually, the idea is that a String has Unicode throughout, with no > encoding. A string is simply a flat array of Unicode code points. > > To optimize space usage we choose the lowest number of bytes per > character that can encompass all code points in a String. This is > implemented as specialized subclasses of String. So for code points > below 256 we use ByteString (8 bit per char), for all others > WideString (32 bits per char). This is purely space optimization, > not a change in encoding. > Yes, I understand how m17n was implemented in Squeak. I'm trying to challenge one of the ideas that underlies Janko's proposal, which you layout beautifully above: "String has Unicode throughout, with no encoding." And again at the end: "This is purely space optimization, not a change in encoding." If a String were a flat array of Unicode code points, it would be implemented in Smalltalk as an array of Characters wouldn't it? The fact that you've chosen to hide the internal representation of the string and use a "variable byte" or "variable word" subclass to store bytes, rather than objects, is an indication that the strings *are* encoded. In fact, the encodings have names: ISO 8859-1 and UCS-4. Janko is proposing to add a string class that internally stores strings encoded in UCS-2 to the mix. So what's so holy about these particular encodings, besides the fact that they're especially efficient on the VisualWorks VM? Colin |
In reply to this post by Janko Mivšek
Hi Janko,
> >> 1. internally everything is in 16bit Unicode, without any additionally > >> encoding info attached to strings > > > > If they use 16-bit per char, how do they deal with surrogated pairs? > > I looked once again and there is actually a FourByteString too. This > probably answer your question. VW also support Japanese locale well. > Just for correction. VW does not support "surrogate pairs" well. A Character whose value is greater than 65535 would easily crash the image. This is a quote of Character comment. -- For character codes between 0 and 65535 (16rFFFF), the Unicode Character Code Standard is used. Characters with codes between 0 and 255 also coincide with the ISO 8859-1 standard. At present, mappings for Characters greater than 65535 are undefined, and such characters are not fully supported. In time, these will probably be defined to conform to the ISO 10646 superset of Unicode. --- In VW, Japanese string is represented as TwoByteString. So, it cannot handle a part of Japanese characters. (But practically, in most cases, it is enough. And it is also good for reducing memory consumption). Regards, -- [:masashi | ^umezawa] |
Free forum by Nabble | Edit this page |