Janko,
> > So, the question to you is that if you have a system with 8-bit > > ByteString and 32-bit WideString in year 2007, would you add a class > > to represent 16-bit string to that system? > > I would say yes, because for most countries 16-bit is enough and 32-bit > is then just a waste of memory. And I just noticed that WideString is > actually fixed to 4 bytes. I would therefore think about renaming it to > ForByteString and add TwoByteString (or similar names). For user these > are always Strings anyway, as SmallIntegers and LargeIntegers are always > Integers. Similar deal in Squeak, too. The system does the auto coertion between WideString and ByteString, and the user doesn't have to deal with them not all the time. Adding 16-bit is surely an option. At the same time, there is similar but different POV: "because for most users 8-bit is enough and 32-bit version is used not so frequently anyway". There is no "right" answer, but different trade-offs. (That is why this problem is interesting^^;) And actually, adding more general character object that doesn't rely on a particular bit-representation (and therefore can go beyond 32-bit), and make the strings be array of such characters will be better eventually. -- Yoshiki |
In reply to this post by Yoshiki Ohshima
<Alan L>Each String object should specify its encoding scheme. UTF-8 should > be the default, but all commonly-encounterd encodings should be > supported, and should all be useable at once (in different String > instances.) When a Character is reified from a String, it should use > the Unicode code point values (full 32-bit value.) Ideally, the > encoding of a String should be a function of an associated Strategy > object, and not be based on having different subclasses of String </Alan L> <Yoshiki>Is this better than using UTF32 throught the image for all Strings? One reason would be that for some chars in domestic encodings, the round-trip conversion is not exactly guaranteed; so you can avoid that problem in this way. But ohter than that, encodings only matters when the system is interfacing with the outside world. So, the internal representation can be uniform, I think. Would you write all comparison methods for all of combinations of different encodings? </Yoshiki> Well, perhaps UTF-32 would be a better default, now that I think about it--due to performance issues for accessing characters at an index. But using 32-bit-wide or 16-bit-wide strings internally as the only option would be a waste of memory in many situations, especially for the "Latin-1" languages. Having String instances that use specified encodings enables one to avoid doing conversions unless and until it's needed. It also makes it easy to deal with the data as it will actually exist when persisted, or when transported over the network. And it makes it easier to handle the host plaform's native character encodings (there may be more than one,) or the character encodings used by external libraries or applications that either offer callpoints to, or consume callpoints from, a Squeak process. It also documents the encoding used by each String. If all Strings use UTF-32, and are only converted to other encodings by the VM, how does one write Smalltalk code to convert text from one character encoding to another? I'd rather not make character encodings yet another bit of magic that only the VM can do. It is already the case that accessing individual characters from a String results in the reification of a Character object. So, leveraging what is already the case, convervsion to/from the internal encoding to the canonical (Unicode) encoding should occur when a Character object is reified from an encoded character in a String (or in a Stream.) Character objects that are "put:" into a String would be converted from the Unicode code point to the encoding native to that String. Using Character reification to/from Unicode as the unification mechanism provides the illusion that all Strings use the same code points for their characters, even though they in fact do not. Of course, for some encodings (such as UTF-8) there would probably be a performance penalty for accessing characters at an arbitrary index ("aString at: n.") But there may be good ways to mitigate that, using clever implementation tricks (caveat: I haven't actually tried it.) However, with my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all Strings, or ASCII for all Strings--based on one's space and performance constraints, and based on the character repertoire one needs for one's user base. And the conversion to UTF-16 or UTF-32 (or whatever) can be done when the String is read from an external Stream (using the VW stream decorator approach, for example.) The ASCII encoding would be good for the mutlitude of legacy applications that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx legacy applications that have to deal with non-English languages, or have to deal with either HTML or pre-Vista Windows. UTF-x would be best for most other situations. --Alan |
In reply to this post by J J-6
<Alan L>UTF-8 should be the default</Alan L>
<J J (Jason)>Wouldn't that be a pretty big speed impact given how much strings are used?</J J (Jason)> Now that I think about it, that could very well be the case. There might be clever ways to make the impact much less than one might otherwise expect (for example, RunArrays were a clever way to make Text objects reasonably efficient)--but I haven't actually implmented it, so there's no guarantee. So, perhaps the default internal String encoding should be UTF-32, instead of UTF-8 or UTF-16, in order to avoid the performance issue. But that raises a memory usage issue--which is the primary reason I don't think a "one size fits all" approach is sufficient. --Alan |
In reply to this post by Yoshiki Ohshima
On Thu, Jun 07, 2007 at 08:16:21PM -0700, Alan Lovejoy wrote:
> It is already the case that accessing individual characters from a String > results in the reification of a Character object. So, leveraging what is > already the case, convervsion to/from the internal encoding to the canonical > (Unicode) encoding should occur when a Character object is reified from an > encoded character in a String (or in a Stream.) Character objects that are > "put:" into a String would be converted from the Unicode code point to the > encoding native to that String. Using Character reification to/from Unicode > as the unification mechanism provides the illusion that all Strings use the > same code points for their characters, even though they in fact do not. Someone already mentioned the way Plan-9 did this, and provided a link, which I read, and it sounded pretty logical. What follows is my assessment of what I read. The key realization that Plan-9 made is that random-access string access is the exception, rather than the rule. Stream access is much more common, and much more in need of optimization. This seems logical to me. UTF-8 is a stream-oriented encoding of Unicode that Plan-9 invented to solve this optimization issue. UTF-8 is self-synchronizing and byte-oriented, which allows a reader to be nearly stateless, and still consume much less memory that UCS-32. Plan 9 also described that, contrary to what some expect, very few programs do better with UCS-32, because very few programs really need to process the string in a non-linear way. Regular expressions and sorting are the two main exceptions. UTF-8 also allows the transition to be made slightly more smoothly, since many ASCII programs will already work with UTF-8. This is a synopsis of what I read. I am not familiar with this issue as much as you are. -- Matthew Fulmer -- http://mtfulmer.wordpress.com/ Help improve Squeak Documentation: http://wiki.squeak.org/squeak/808 |
In reply to this post by Yoshiki Ohshima
Alan,
> Well, perhaps UTF-32 would be a better default, now that I think about > it--due to performance issues for accessing characters at an index. But > using 32-bit-wide or 16-bit-wide strings internally as the only option would > be a waste of memory in many situations, especially for the "Latin-1" > languages. We do switch in Squeak from different bit-width representation (8 and 32) whenever necessary or favorable. > Having String instances that use specified encodings enables one to avoid > doing conversions unless and until it's needed. It also makes it easy to > deal with the data as it will actually exist when persisted, or when > transported over the network. And it makes it easier to handle the host > plaform's native character encodings (there may be more than one,) or the > character encodings used by external libraries or applications that either > offer callpoints to, or consume callpoints from, a Squeak process. It also > documents the encoding used by each String. Nothing prevents you from using a String as if it is, say, a ByteArray. For example, you can pass a String or ByteArray to a socket primitive to fill it and you can keep the bits in it as you like. However, Smalltalk is not just about holding data; once it comes to displaying a String, concatenating them, comparing them, etc., etc., you do have to have a canonical form. > If all Strings use UTF-32, and are only converted to other encodings by the > VM, how does one write Smalltalk code to convert text from one character > encoding to another? I'd rather not make character encodings yet another > bit of magic that only the VM can do. Hmm. Of course you can convert encodings in memory. In Squeak, there are bunch of subclasses of TextConverter. Did anybody mentioned/suggested that the conversion has to be a VM magic? > It is already the case that accessing individual characters from a String > results in the reification of a Character object. So, leveraging what is > already the case, convervsion to/from the internal encoding to the canonical > (Unicode) encoding should occur when a Character object is reified from an > encoded character in a String (or in a Stream.) Character objects that are > "put:" into a String would be converted from the Unicode code point to the > encoding native to that String. Using Character reification to/from Unicode > as the unification mechanism provides the illusion that all Strings use the > same code points for their characters, even though they in fact do > not. You criticized an approach nobody advocated as "magic" in above, but what you wrote here is really a magic. I've got a feeling that this system would be very hard to debug. BTW, what would you do with Symbols? > Of course, for some encodings (such as UTF-8) there would probably be a > performance penalty for accessing characters at an arbitrary index ("aString > at: n.") But there may be good ways to mitigate that, using clever > implementation tricks (caveat: I haven't actually tried it.) However, with > my proposal, one is free to use UTF-16 for all Strings, or UTF-32 for all > Strings, or ASCII for all Strings--based on one's space and performance > constraints, and based on the character repertoire one needs for one's user > base. And the conversion to UTF-16 or UTF-32 (or whatever) can be done when > the String is read from an external Stream (using the VW stream decorator > approach, for example.) I *do* see some upsides of this approach, actually, but the downsides is overwhelming bigger, if you think that Smalltalk is a self-contained system. Handling keyboard input alone would make the system really complex. IIUC, Matsumoto-san's (Matz) m17n idea for Ruby is sort of along this line. I don't think that is a good approach, but it is slightly more acceptable in Ruby, because Ruby is not a whole system. BTW, current Squeak allows you to do this. Within the 32-bit quantity, the first several bits denotes the "language"; you can make up a special language and store the code point in different encodings. > The ASCII encoding would be good for the mutlitude of legacy applications > that are English-only. ISO 8859-1 would be best for post-1980s/pre-UTFx > legacy applications that have to deal with non-English languages, or have to > deal with either HTML or pre-Vista Windows. UTF-x would be best for most > other situations. Is this your observation? Where does legacy application in Japanese fit? Why HTML is associated with latin-1? What is special about Vista Windows? This doesn't make any good sense. One approach I might try in a "new system" would be: - the bits of raw string representation is in UTF-8 but it is not really displayable. - you always do stuff though an equivalent of Text, that carry enough attributes for the bits. - maybe remove character object. A "character" is just a short Text. For the ASCII part, it could be a special case; i.e., a naked byte can have implicit text attributes by default. -- Yoshiki |
Just as a side-note: In Seaside the encoding and decoding turns out to
be very complicated and expensive. In fact so expensive, that almost nobody is willing to pay for it. What most people do is to work with (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data is received, stored, and sent exactly the way it comes from the socket. Byte identical strings are sent back as they were received. There are many cravats: 1. Most string operations don't work (except concatenation), e.g. asking a string for its #size might return a wrong number. 2. All literal strings have to be encoded manually to the right format. This clutters the code and is ugly. 3. Data in inspectors is sometimes not readable without a manual conversion. I am no expert with encodings, so I have no idea how this could be cleanly solved. There is definitely the need for improvement. Another issue I observed is that Characters in Squeak have an inconsistent behavior for #==. For characters with codePoint > 256 the identity is not preserved. This gives problems with code that uses #== to compare characters, legacy code and code ported from VisualWorks (SmaCC for example). In VisualWorks Characters are unique, just like Symbols are. Lukas -- Lukas Renggli http://www.lukas-renggli.ch |
Lukas Renggli wrote:
> Just as a side-note: In Seaside the encoding and decoding turns out to > be very complicated and expensive. In fact so expensive, that almost > nobody is willing to pay for it. But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the first, just fix it ;-) If the second, what conversions are slow? If the third, why not speed it up by a primitive? (UTF-8 translation isn't that hard) > What most people do is to work with > (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data > is received, stored, and sent exactly the way it comes from the > socket. Byte identical strings are sent back as they were received. I assume you mean Seaside 2.7 above not Squeak 2.7. > I am no expert with encodings, so I have no idea how this could be > cleanly solved. There is definitely the need for improvement. How about trying to improve the speed of conversions? You seem to imply that this is the major issue here, so if the conversions where blindingly fast (which I think they easily could by writing one or two primitives) this should improve matters. > Another issue I observed is that Characters in Squeak have an > inconsistent behavior for #==. For characters with codePoint > 256 the > identity is not preserved. This gives problems with code that uses #== > to compare characters, legacy code and code ported from VisualWorks > (SmaCC for example). In VisualWorks Characters are unique, just like > Symbols are. Yeah, but there isn't really an easy workaround unless you have immediate characters. Which Squeak doesn't so fixing those comparisons to use equality is really your only option (FWIW, given that VW has a good JIT I would expect that they can inline this trivially so there shouldn't be a speed difference for VW). Cheers, - Andreas |
> > Just as a side-note: In Seaside the encoding and decoding turns out to
> > be very complicated and expensive. In fact so expensive, that almost > > nobody is willing to pay for it. > > But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the > first, just fix it ;-) If the second, what conversions are slow? If the > third, why not speed it up by a primitive? (UTF-8 translation isn't that > hard) I would if I knew how to do it. > > What most people do is to work with > > (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data > > is received, stored, and sent exactly the way it comes from the > > socket. Byte identical strings are sent back as they were received. > > I assume you mean Seaside 2.7 above not Squeak 2.7. I am talking about Squeak 3.7. There are many Seaside users that will stick with Squeak 3.7 forever. > How about trying to improve the speed of conversions? You seem to imply > that this is the major issue here, so if the conversions where > blindingly fast (which I think they easily could by writing one or two > primitives) this should improve matters. Are you taking about escaping? In Seaside 2.8 the escaping is already 2 times faster than in Seaside 2.7. Character encoding is another story. Lukas -- Lukas Renggli http://www.lukas-renggli.ch |
Lukas Renggli wrote:
>> But is that a property of 1) Seaside or 2) Squeak or 3) UTF-8? If the >> first, just fix it ;-) If the second, what conversions are slow? If the >> third, why not speed it up by a primitive? (UTF-8 translation isn't that >> hard) > > I would if I knew how to do it. I'll see if I can find some time on the weekend to look at this. >> > What most people do is to work with >> > (Squeak 2.7 or) ByteStrings that they treat like ByteArrays. The data >> > is received, stored, and sent exactly the way it comes from the >> > socket. Byte identical strings are sent back as they were received. >> >> I assume you mean Seaside 2.7 above not Squeak 2.7. > > I am talking about Squeak 3.7. There are many Seaside users that will > stick with Squeak 3.7 forever. Yes, using Squeak ->3<-.7 can make good sense for people who don't care about using m17n internally (definitely more than using Squeak ->2<-.7 as you wrote initially). >> How about trying to improve the speed of conversions? You seem to imply >> that this is the major issue here, so if the conversions where >> blindingly fast (which I think they easily could by writing one or two >> primitives) this should improve matters. > > Are you taking about escaping? In Seaside 2.8 the escaping is already > 2 times faster than in Seaside 2.7. Character encoding is another > story. I'm talking about UTF-8 conversions. A simple thing to do would be (for example) to have a lookup table for everything covered by 2-byte encodings (which is practically everything in the western hemisphere). Something like here: nextFromStream: stream "Read a UTF-8 encoded character from the stream" value1 := utf8Table at: stream nextByte. value1 isCharacter ifTrue:[^value1]. value1 isArray ifTrue:[ value2 := value1 at: stream nextByte. value2 isCharacter ifTrue:[^value2]. "... put the slow code here ..." (note that the lookup table can include the required language tags etc. to make any further conversion unnecessary) Beyond which a primitive would go a very long way here. Cheers, - Andreas |
In reply to this post by Yoshiki Ohshima
On Friday 08 June 2007 1:25 am, Yoshiki Ohshima wrote:
.. > > Well, UTF8 is just an encoding of Unicode code points, So, Squeak will > > have to support Unicode. Its language and tools will need to handle > > Unicode code points and UTF8 streams. Internally, whether code points or > > UTF8 encoding is used would depend on the context. > > Why do you get the impression that Squeak doesn't support it? Squeak's Unicode/UTF8 support seemed incomplete. I couldn't get Squeak on Linux to take in ½ or π. How about : a) Use Unicode chars in literals and text fields. I should be able to write math equations in PluggableText. b) Use Unicode chars in names (object, method, variable, symbols). Children should be able to name their scripts and variables in their language in Etoys. c) See fallback glyphs for Unicode. Like four hex digits laid out 2x2 in a small box the same height as the current font. It works much better than [] box. d) Have Buttons that generate Unicode. This could be used to build soft keyboards. (cf. PopUpMenu>>readKeyboard uses asciiValue :-(). e) Use Modal input - codes coming in from Sensors could be button presses (e.g. ESC, hotkeys to switch keyboard layouts, ) or multilingual text sequences. f) See 'current language' indicator in input fields. Handling backspace will be language dependent. > Using UTF-8 internally throughot the system would be a challenge, > especially thinking about that the overloaded methods like at:, > at:put: and all of these have to be disambiguated as to what it means. at:put: is a random access operation and UTF-8 is not meant for such purposes. UTF-8 works well for streams of characters and Unicode for random access and lookup. This is what I meant when I said it would depend on context. Then there are mixed streams like keyboard input. I could be reading button presses (like Enter for OK) or reading in a stream of characters in a text field. We may need instream character codes to switch modes and language. I am still coming upto speed on Squeak multilingual support and these observations are based on my explorations so far. It is quite possible that I may have missed something. Regards .. Subbu |
In reply to this post by Yoshiki Ohshima
On Friday 08 June 2007 2:41 am, Yoshiki Ohshima wrote:
> However, there is a reason to call our stuff m17n, instead of i18n. > It might be still an aspiration to it, but supporting one language at > a time "sort of localed based idea" is not enough for "real" > multilingualization, where you would like to mix strings from > different languages freely. Very true. India has over 28 official languages and multilingual streams are the norm rather than the exception. Children learn three languages in primary school. Math texts make heavy use of 'math symbols'. Regards .. Subbu |
In reply to this post by Yoshiki Ohshima
On Friday 08 June 2007 11:28 am, Yoshiki Ohshima wrote:
> > Of course, for some encodings (such as UTF-8) there would probably be a > > performance penalty for accessing characters at an arbitrary index > > ("aString at: n.") But there may be good ways to mitigate that, using > > clever implementation tricks (caveat: I haven't actually tried it.) > > However, with my proposal, one is free to use UTF-16 for all Strings, or > > UTF-32 for all Strings, or ASCII for all Strings--based on one's space > > and performance constraints, and based on the character repertoire one > > needs for one's user base. And the conversion to UTF-16 or UTF-32 (or > > whatever) can be done when the String is read from an external Stream > > (using the VW stream decorator approach, for example.) > > I *do* see some upsides of this approach, actually, but the > downsides is overwhelming bigger, if you think that Smalltalk is a > self-contained system. Handling keyboard input alone would make the > system really complex. points. A Unicode code point is 16-bits and UTF-8 varies from 8 to 32-bits. I Is there any sound case for other UTFs now (outside of VMs)? The Wikipedia entry below has a good summary of pros and cons: http://en.wikipedia.org/wiki/UTF-8 Rob Pike's note: http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt is a very good reality check on the situation. For children who will be working in multilingual environment, Squeak will be spending most of its time in waiting for a button/key push anyways :-). Regards .. Subbu |
In reply to this post by Lukas Renggli
Lukas Renggli <renggli <at> gmail.com> writes:
> > Another issue I observed is that Characters in Squeak have an > inconsistent behavior for #==. For characters with codePoint > 256 the > identity is not preserved. This gives problems with code that uses #== > to compare characters, legacy code and code ported from VisualWorks > (SmaCC for example). In VisualWorks Characters are unique, just like > Symbols are. > > Lukas > Just for the detail, Characters are unique like SmallIntegers are. VW use last-two-bits-of-object-pointer trick so that Characters are immediate values. Nicolas |
In reply to this post by Andreas.Raab
On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote: > How about trying to improve the speed of conversions? You seem to > imply that this is the major issue here, so if the conversions > where blindingly fast (which I think they easily could by writing > one or two primitives) this should improve matters. The conversions could be made faster, yes. But consider this: the life-cycle of a string in a web app is very often something like this: - comes in over HTTP - lives in the image for a while, maybe persisted in some way - gets sent back out over HTTP many times Even if the conversion *is* blindingly fast, it's still better to leave it as UTF-8 the whole time, not only to remove the overhead of decoding and reencoding, but also to avoid storing WideStrings in the image for long periods of time. Also, consider that building html pages mainly involves writing lots of short strings to streams, which sometimes include non-ASCII characters. If they can be pre-encoded it's another space and time win. On the other hand, the traditional drawback to UTF-8, random access to characters, doesn't come up much with generating web pages, though of course a web app may do this kind of thing as part of its domain functionality. I don't claim that all strings should always be UTF-8, but having a UTF8String class would be an excellent thing. Colin |
Colin
Could you say the difference between WidString and UTF-8 (UTF-8 would a specialized WideString?). I got bitten by these encodings problems and having a nice solution would be good. Stef On 9 juin 07, at 00:02, Colin Putney wrote: > > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote: > >> How about trying to improve the speed of conversions? You seem to >> imply that this is the major issue here, so if the conversions >> where blindingly fast (which I think they easily could by writing >> one or two primitives) this should improve matters. > > The conversions could be made faster, yes. But consider this: the > life-cycle of a string in a web app is very often something like this: > > - comes in over HTTP > - lives in the image for a while, maybe persisted in some way > - gets sent back out over HTTP many times > > Even if the conversion *is* blindingly fast, it's still better to > leave it as UTF-8 the whole time, not only to remove the overhead > of decoding and reencoding, but also to avoid storing WideStrings > in the image for long periods of time. Also, consider that building > html pages mainly involves writing lots of short strings to > streams, which sometimes include non-ASCII characters. If they can > be pre-encoded it's another space and time win. On the other hand, > the traditional drawback to UTF-8, random access to characters, > doesn't come up much with generating web pages, though of course a > web app may do this kind of thing as part of its domain functionality. > > I don't claim that all strings should always be UTF-8, but having a > UTF8String class would be an excellent thing. > > Colin > > |
2007/6/9, stephane ducasse <[hidden email]>:
> Colin > > Could you say the difference between WidString and UTF-8 (UTF-8 would > a specialized WideString?). The way I understand it UTF8String would be a subclass of ByteString and probably have methods like #size, #first:, #last: and #at: overriden. > I got bitten by these encodings problems and having a nice solution > would be good. Well, there is what the evil language with J does: UCS2 everywhere, no excuses. This is a bit awkward for characters outside the BMP (which are more rare than unicorns) but IIRC the astral planes didn't exits when it was created. So you could argue for UCS4. Yes, it's twice the size, but who really cares? If you could get rid of all the size hacks in Squeak that were cool in the 70ies, would you? Cheers Philippe > Stef > > On 9 juin 07, at 00:02, Colin Putney wrote: > > > > > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote: > > > >> How about trying to improve the speed of conversions? You seem to > >> imply that this is the major issue here, so if the conversions > >> where blindingly fast (which I think they easily could by writing > >> one or two primitives) this should improve matters. > > > > The conversions could be made faster, yes. But consider this: the > > life-cycle of a string in a web app is very often something like this: > > > > - comes in over HTTP > > - lives in the image for a while, maybe persisted in some way > > - gets sent back out over HTTP many times > > > > Even if the conversion *is* blindingly fast, it's still better to > > leave it as UTF-8 the whole time, not only to remove the overhead > > of decoding and reencoding, but also to avoid storing WideStrings > > in the image for long periods of time. Also, consider that building > > html pages mainly involves writing lots of short strings to > > streams, which sometimes include non-ASCII characters. If they can > > be pre-encoded it's another space and time win. On the other hand, > > the traditional drawback to UTF-8, random access to characters, > > doesn't come up much with generating web pages, though of course a > > web app may do this kind of thing as part of its domain functionality. > > > > I don't claim that all strings should always be UTF-8, but having a > > UTF8String class would be an excellent thing. > > > > Colin > > > > > > > |
In reply to this post by stephane ducasse
On Jun 9, 2007, at 12:24 AM, stephane ducasse wrote: > Could you say the difference between WidString and UTF-8 (UTF-8 > would a specialized WideString?). WideString is a fixed length encoding - each character is 4 bytes long. UTF-8 is a variable length encoding - where each character could be 1, 2 or 3 bytes. The problem with WideString is that it wastes memory. Most characters can fit into 2 bytes, and all of them can fit into 3 bytes. The problem with UTF-8 is that it makes random access expensive. UTF8String>>at: would have to do a linear search through the string to find the character at a given offset. > I got bitten by these encodings problems and having a nice solution > would be good. I don't think there's a single solution that's good for all problems. For the kind of web applications that I work on, UTF-8 is a clear win. For other kinds of applications , WideString and maybe TwoByteString are probably better. Colin |
In reply to this post by Philippe Marschall
Philippe Marschall wrote:
>> I got bitten by these encodings problems and having a nice solution >> would be good. > Well, there is what the evil language with J does: UCS2 everywhere, no > excuses. This is a bit awkward for characters outside the BMP (which > are more rare than unicorns) but IIRC the astral planes didn't exits > when it was created. So you could argue for UCS4. Yes, it's twice the > size, but who really cares? If you could get rid of all the size hacks > in Squeak that were cool in the 70ies, would you? All of us who use image as a database care about space efficiency but on the other side we want all normal string operations to run on unicode strings too. That's why UTF8 encoded string is not appropriate even that it is most space efficient, because string operations are not fast enough. I would propose a hibrid solution: three subclasses of String: 1. ByteString for ASCII (native english speakers 2. TwoByteString for most of other languages 3. FourByteString(WideString) for Japanese/Chinese/and others And even for 2nd group and for short strings a plain ASCII satisfies in many cases. For Slovenian I would say for 80% of short strings (we have only 蚞ȊŽ as non-ascii chars). I think most of latin Europe has similar situation. Conversion between strings should be automatic as is with numbers. You start with ASCII only ByteString and when you first encounter a character >256, you convert to TwoByteString, and then to FourByteString. Best regards Janko >> Stef >> >> On 9 juin 07, at 00:02, Colin Putney wrote: >> >> > >> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote: >> > >> >> How about trying to improve the speed of conversions? You seem to >> >> imply that this is the major issue here, so if the conversions >> >> where blindingly fast (which I think they easily could by writing >> >> one or two primitives) this should improve matters. >> > >> > The conversions could be made faster, yes. But consider this: the >> > life-cycle of a string in a web app is very often something like this: >> > >> > - comes in over HTTP >> > - lives in the image for a while, maybe persisted in some way >> > - gets sent back out over HTTP many times >> > >> > Even if the conversion *is* blindingly fast, it's still better to >> > leave it as UTF-8 the whole time, not only to remove the overhead >> > of decoding and reencoding, but also to avoid storing WideStrings >> > in the image for long periods of time. Also, consider that building >> > html pages mainly involves writing lots of short strings to >> > streams, which sometimes include non-ASCII characters. If they can >> > be pre-encoded it's another space and time win. On the other hand, >> > the traditional drawback to UTF-8, random access to characters, >> > doesn't come up much with generating web pages, though of course a >> > web app may do this kind of thing as part of its domain functionality. >> > >> > I don't claim that all strings should always be UTF-8, but having a >> > UTF8String class would be an excellent thing. >> > >> > Colin >> > >> > >> >> >> > > -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
2007/6/9, Janko Mivšek <[hidden email]>:
> Philippe Marschall wrote: > >> I got bitten by these encodings problems and having a nice solution > >> would be good. > > Well, there is what the evil language with J does: UCS2 everywhere, no > > excuses. This is a bit awkward for characters outside the BMP (which > > are more rare than unicorns) but IIRC the astral planes didn't exits > > when it was created. So you could argue for UCS4. Yes, it's twice the > > size, but who really cares? If you could get rid of all the size hacks > > in Squeak that were cool in the 70ies, would you? > > All of us who use image as a database care about space efficiency but on > the other side we want all normal string operations to run on unicode > strings too. like Morphs. And it sucks as a database (ACID transactions anyone?). Don't even get me started on migration (like Squeak Chronlogy classes). Philippe > That's why UTF8 encoded string is not appropriate even that > it is most space efficient, because string operations are not fast enough. > > I would propose a hibrid solution: three subclasses of String: > > 1. ByteString for ASCII (native english speakers > 2. TwoByteString for most of other languages > 3. FourByteString(WideString) for Japanese/Chinese/and others > > And even for 2nd group and for short strings a plain ASCII satisfies in > many cases. For Slovenian I would say for 80% of short strings (we have > only 蚞ȊŽ as non-ascii chars). I think most of latin Europe has > similar situation. > > Conversion between strings should be automatic as is with numbers. You > start with ASCII only ByteString and when you first encounter a > character >256, you convert to TwoByteString, and then to FourByteString. > > Best regards > Janko > > > > >> Stef > >> > >> On 9 juin 07, at 00:02, Colin Putney wrote: > >> > >> > > >> > On Jun 7, 2007, at 11:55 PM, Andreas Raab wrote: > >> > > >> >> How about trying to improve the speed of conversions? You seem to > >> >> imply that this is the major issue here, so if the conversions > >> >> where blindingly fast (which I think they easily could by writing > >> >> one or two primitives) this should improve matters. > >> > > >> > The conversions could be made faster, yes. But consider this: the > >> > life-cycle of a string in a web app is very often something like this: > >> > > >> > - comes in over HTTP > >> > - lives in the image for a while, maybe persisted in some way > >> > - gets sent back out over HTTP many times > >> > > >> > Even if the conversion *is* blindingly fast, it's still better to > >> > leave it as UTF-8 the whole time, not only to remove the overhead > >> > of decoding and reencoding, but also to avoid storing WideStrings > >> > in the image for long periods of time. Also, consider that building > >> > html pages mainly involves writing lots of short strings to > >> > streams, which sometimes include non-ASCII characters. If they can > >> > be pre-encoded it's another space and time win. On the other hand, > >> > the traditional drawback to UTF-8, random access to characters, > >> > doesn't come up much with generating web pages, though of course a > >> > web app may do this kind of thing as part of its domain functionality. > >> > > >> > I don't claim that all strings should always be UTF-8, but having a > >> > UTF8String class would be an excellent thing. > >> > > >> > Colin > >> > > >> > > >> > >> > >> > > > > > > -- > Janko Mivšek > AIDA/Web > Smalltalk Web Application Server > http://www.aidaweb.si > > |
Hi Phillipe,
Philippe Marschall wrote: > 2007/6/9, Janko Mivšek <[hidden email]>: >> Philippe Marschall wrote: >> All of us who use image as a database care about space efficiency but on >> the other side we want all normal string operations to run on unicode >> strings too. > > The image is not an efficient database. It stores all kind of "crap" > like Morphs. And it sucks as a database (ACID transactions anyone?). > Don't even get me started on migration (like Squeak Chronlogy > classes). I admit that I came from VW where I'm running quite a number of web apps on images which servers also as a sole database and that just works, reliable and fast. Now I'm thinking to do the same in Squeak. That is, to use squeak image as a database, fast and reliable. Am I too naive? Best regards Janko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
Free forum by Nabble | Edit this page |