> I admit that I came from VW where I'm running quite a number of web apps
> on images which servers also as a sole database and that just works, > reliable and fast. > > Now I'm thinking to do the same in Squeak. That is, to use squeak image > as a database, fast and reliable. Am I too naive? A lot of people are doing that with Squeak as well. For example all my Pier instances are using the image as a database. It is (was) the same for SmallWiki. It is certainly fast. The reliability depends on the VM you are using. Some VMs are known to crash if they have been started with the wrong parameters or if they grow over a certain amount of memory. Lukas -- Lukas Renggli http://www.lukas-renggli.ch |
In reply to this post by Philippe Marschall
<Philippe>Well, there is what the evil language with J does: UCS2
everywhere, no excuses. This is a bit awkward for characters outside the BMP (which are more rare than unicorns) but IIRC the astral planes didn't exits when it was created. So you could argue for UCS4. Yes, it's twice the size, but who really cares? If you could get rid of all the size hacks in Squeak that were cool in the 70ies, would you?</Philippe> Note: UTF-32 and UCS-4 are different names for the same thing [Reference: http://en.wikipedia.org/wiki/UTF-32] There is no one solution that is good enough for all use cases. UTF-32 is fast for indexed chacter reading/writing. It also comprehensively covers the entire Unicode Universal Character Set--not just those in the Basic Multilingual Plane. But it also is not very space efficient. [Reference: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters.] Although you could have a different subclass of String for each encoding, that's a poor use of inheritance. It's better to have a single String class that uses an associated Strategy object (stored in one of the instance variables of a String--the other holding a ByteArray containing the characters.) The CharacterEncoding class would have a subclass for each different encoding. The byteArray would hold the String's data, whose character content would be interpreted by the Strategy object (an instance of CharacterEncoding.) To achieve semantic unification across any and all character encodings, the rule would be that when a Character object is reified from a String, it always uses the Unicode code point ("integer code value.") And when a Character is "put:" into a String, its canonical (Unicode) code point is translated to be correct for that String's encoding. Both conversions would be the responsibility of the String's Strategy object (an instance of CharacterEncoding.) This implementation architecture lets each application (or package/module/code-library) choose the encoding that best suits its use case, but prevents character code mapping errors when characters are copied between Strings whose encodings are not the same. In the case of the variable-byte encodings, it might be possible to achieve significant performance improvements by having the CharacterEncoding instance store information that helps to more quickly translate between logical character indices and physical byte indices within the String's ByteArray (the RunArray of a Text is a good analogy for what I have in mind here.) --Alan |
In reply to this post by Lukas Renggli
If you run applications that don't need to execute transactions or
care a lot to lose some part of the objects if the application crashes, and you also find a nice way to share the instances of your object model among different running instances (very common in Seaside to run several images). Although I have never used VW, it is probably not a lot different, since Seaside apps can be pretty stable. But if you need to execute "persistent" transactions, then take a look at Magma, Glorp, GOODS or even the free version of GemStone. r. On 6/9/07, Lukas Renggli <[hidden email]> wrote: > > I admit that I came from VW where I'm running quite a number of web apps > > on images which servers also as a sole database and that just works, > > reliable and fast. > > > > Now I'm thinking to do the same in Squeak. That is, to use squeak image > > as a database, fast and reliable. Am I too naive? > > A lot of people are doing that with Squeak as well. For example all my > Pier instances are using the image as a database. It is (was) the same > for SmallWiki. > > It is certainly fast. The reliability depends on the VM you are using. > Some VMs are known to crash if they have been started with the wrong > parameters or if they grow over a certain amount of memory. > > Lukas > > -- > Lukas Renggli > http://www.lukas-renggli.ch > > |
In reply to this post by K. K. Subramaniam
FWIW, I believe this is how NSString works on the Mac and it has been unicode capable for a very long time.
On Jun 9, 2007, at 1:54 PM, Alan Lovejoy wrote:
|
In reply to this post by Janko Mivšek
Janko Mivšek wrote:
> I would propose a hibrid solution: three subclasses of String: > > 1. ByteString for ASCII (native english speakers > 2. TwoByteString for most of other languages > 3. FourByteString(WideString) for Japanese/Chinese/and others Let me be more exact about that proposal: This is for internal representation only, for interfacing to external world we need to convert to/from (at least) UTF-8 representation. 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe). ByteString is therefore always regarded as encoded in ISO8859-1 codepage, which is the same as Unicode Basic Latin (1). 2. TwoByteString for East European Latin, Greek, Cyrillic and many more (so called Basic Multilingual Pane (2)). Encoding of that string would correspond to UCS-2, even that it is considered obsolete (3) 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding of that string would therefore correspond to UCS-4/UTF-32 (4) I think that this way we can achieve most efficient yet fast support for all languages on that world. Because of fixed length those strings are also easy to manipulate contrary to variable length UTF-8 ones. Conversion to/form UTF-8 could probably also be simpler with help of bit arithmetic algorithms, which would be tailored differently for each of proposed three string subclasses above. (1) Wikipedia Unicode: Storage, transfer, and processing http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing (2) Wikipedia Basic Multilingual Pane http://en.wikipedia.org/wiki/Basic_Multilingual_Plane (3) Wikipedia UTF-16/UCS-2: http://en.wikipedia.org/wiki/UCS-2 (4) Wikipedia UTF-32/UCS-4 http://en.wikipedia.org/wiki/UTF-32/UCS-4 Best regards Janko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote: > I think that this way we can achieve most efficient yet fast > support for all languages on that world. Because of fixed length > those strings are also easy to manipulate contrary to variable > length UTF-8 ones. "Most efficient yet fast" is a matter of perspective. For the apps I work on, UTF-8 is better than your scheme because space efficiency is more important than random access, and time spent encoding and decoding UTF-8 would dwarf time spent scanning for random access. As soon as you try to support more than 256 characters, there are trade-offs to be made. The "ideal" solution depends on your application. How important is memory efficiency vs. space efficiency? How about stream processing vs random access? What format is your input and output? Which characters do you need to support, and how many of them are there? A good string library will be flexible enough to allow its users to make those trade-offs according to the needs of the application. > Conversion to/form UTF-8 could probably also be simpler with help > of bit arithmetic algorithms, which would be tailored differently > for each of proposed three string subclasses above. Yes, a couple of well designed primitives would help quite a bit. Colin |
In reply to this post by K. K. Subramaniam
On Saturday 09 June 2007 6:59 am, Yoshiki Ohshima wrote:
> It is incomplete in many ways. Sure. But that wasn't the issue you > were raising; you were talking about the interface between the image > and VM but the hard part. If multilingual support is implemented in platform-specific fashion, then it will be incomplete till all platform variations are taken into account. >> a) Use Unicode chars in literals and text fields. > You can do this already. I didn't know this. Neither typing it nor cut-n-paste would work for me (VM 3.7.7 on Linux). I am able to use U+0Cxx characters fine in other Linux apps (UTF-8 aware). Am I missing something in my Squeak setup? > We have been doing this many years already. What we can't do is to > display Indic characters yet (which will be solved very soon). Is it working on LInux? for Kannada? I have some public schools that are waiting for Kannada support. Is there anything I can do to speed it up? >> f) See 'current language' indicator in input fields. > What do you mean by "input fields"? Essentially, Morphs that accept text input. In X/KDE/GNOME, there is a keyboard layout indicator that shows what keyboard layout is in effect. But how can a user get hints when Squeak runs fullscreen or in console mode? Regards .. Subbu |
In reply to this post by Janko Mivšek
Hi Janko -
Just as a comment from the sidelines I think that concentrating on the size of the character in an encoding is a mistake. It is really the encoding that matters and if it weren't impractical I would rename ByteString to Latin1String and WideString to UTF32String or so. This makes it much clearer that we are interested more in the encoding than the number of bytes per character (although of course some encodings imply a character size) and this "encoding-driven" view of strings makes it perfectly natural to think of an UTF8String which has a variable sized encoding and can live in perfect harmony with the other "byte encoded strings". In your case, I would rather suggest having a class UTF16String instead of TwoByteString. A good starting point (if you are planning to spend any time on this) would be to create a class EncodedString which captures the basics of conversions between differently encoded strings and start defining a few (trivial) subclasses like mentioned above. From there, you could extend this to UTF-8, UTF-16 and whatever else encoding you need. Cheers, - Andreas Janko Mivšek wrote: > Janko Mivšek wrote: >> I would propose a hibrid solution: three subclasses of String: >> >> 1. ByteString for ASCII (native english speakers >> 2. TwoByteString for most of other languages >> 3. FourByteString(WideString) for Japanese/Chinese/and others > > Let me be more exact about that proposal: > > This is for internal representation only, for interfacing to external > world we need to convert to/from (at least) UTF-8 representation. > > 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe). > ByteString is therefore always regarded as encoded in ISO8859-1 > codepage, which is the same as Unicode Basic Latin (1). > > 2. TwoByteString for East European Latin, Greek, Cyrillic and many more > (so called Basic Multilingual Pane (2)). Encoding of that string > would correspond to UCS-2, even that it is considered obsolete (3) > > 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding > of that string would therefore correspond to UCS-4/UTF-32 (4) > > > I think that this way we can achieve most efficient yet fast support for > all languages on that world. Because of fixed length those strings are > also easy to manipulate contrary to variable length UTF-8 ones. > > Conversion to/form UTF-8 could probably also be simpler with help of bit > arithmetic algorithms, which would be tailored differently for each of > proposed three string subclasses above. > > > (1) Wikipedia Unicode: Storage, transfer, and processing > http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing > (2) Wikipedia Basic Multilingual Pane > http://en.wikipedia.org/wiki/Basic_Multilingual_Plane > (3) Wikipedia UTF-16/UCS-2: > http://en.wikipedia.org/wiki/UCS-2 > (4) Wikipedia UTF-32/UCS-4 > http://en.wikipedia.org/wiki/UTF-32/UCS-4 > > Best regards > Janko > |
In reply to this post by Janko Mivšek
On 6/10/07, Janko Mivšek <[hidden email]> wrote: Hi Phillipe, The crap is going to be there anyway if you're running some sort of live image with a database attached. If you don't want the crap then you can remove it from the image or use a minimal image. Now I'm thinking to do the same in Squeak. That is, to use squeak image You have to have plans for when an image becomes corrupted. I don't think it's too difficult to have the image automatically save itself - say for example once every 5 minutes, once an hour, once a day and once a week. That way if an image becomes corrupted, you can go back to the latest known working image. If data is very important to your application, you could write some code to log changes to a file (aka squeak.changes) and manually read them back in when disaster occurs. Michael. |
2007/6/11, Michael van der Gulik <[hidden email]>:
> > > On 6/10/07, Janko Mivšek <[hidden email]> wrote: > > Hi Phillipe, > > > > Philippe Marschall wrote: > > > 2007/6/9, Janko Mivšek <[hidden email]>: > > >> Philippe Marschall wrote: > > >> All of us who use image as a database care about space efficiency but > on > > >> the other side we want all normal string operations to run on unicode > > >> strings too. > > > > > > The image is not an efficient database. It stores all kind of "crap" > > > like Morphs. > > The crap is going to be there anyway if you're running some sort of live > image with a database attached. If you don't want the crap then you can > remove it from the image or use a minimal image. > > > Now I'm thinking to do the same in Squeak. That is, to use squeak image > > as a database, fast and reliable. Am I too naive? > > > > You have to have plans for when an image becomes corrupted. > > I don't think it's too difficult to have the image automatically save itself > - say for example once every 5 minutes, once an hour, once a day and once a > week. That way if an image becomes corrupted, you can go back to the latest > known working image. > > If data is very important to your application, you could write some code to > log changes to a file (aka squeak.changes) and manually read them back in > when disaster occurs. Philippe > Michael. > > > > |
In reply to this post by Andreas.Raab
On Monday 11 June 2007 2:44 am, Andreas Raab wrote:
> Hi Janko - >.... this "encoding-driven" view of > strings makes it perfectly natural to think of an UTF8String which has a > variable sized encoding and can live in perfect harmony with the other > "byte encoded strings". How about UTF8Stream since UTF8 works best in a stream? The term String has strong connotations of storage and indexing. Regards .. Subbu |
On Mon, 11 Jun 2007 09:15:16 +0200, subbukk wrote:
> On Monday 11 June 2007 2:44 am, Andreas Raab wrote: >> Hi Janko - >> .... this "encoding-driven" view of >> strings makes it perfectly natural to think of an UTF8String which has a >> variable sized encoding and can live in perfect harmony with the other >> "byte encoded strings". > How about UTF8Stream since UTF8 works best in a stream? The term String > has strong connotations of storage and indexing. How about the ages old Smalltalk solution: a decorated string is a Text. The stringArray and the runArray of Text together determine the face of the character. Then you'd have UTF8Text, the stringArray of which can contain plain ASCII if there are no other characters in a UTF8Text. And Dan's double dispatch concept can be employed for the match/replace/etc functions. Just my CHF 0.05 /Klaus > Regards .. Subbu > > |
In reply to this post by Andreas.Raab
Hi Andreas,
Let me start with a statement that Unicode is a generalization of ASCII. ASCII has code points < 128 and therefore always fits in one byte while Unicode can have 2, 3 or even 4 bytes wide code points. No one treats ASCII strings as ASCII "encoded" therefore no one should treat Unicode strings as encoded too. And this is an idea behind my proposal - to have Unicode strings as collections of character code points, with different byte widths. Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which all fit to one byte. ByteStrings which contain plain ASCII are therefore already Unicode! Same with Latin 1 ones. It is therefore just natural to extend Unicode from byte to two and four byte strings to cover all code points. For an user this string is still a string as it was when it was just ASCII. This approach is therefore also most consistent one. When we are talking about Unicode "encodings" we mean UTF (Unicode Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones are both variable length formats, which means that character size is not the same as byte size and it cannot be just simply calculated from it. Each character character may be 1, 2, 3 or 4 bytes depending of the width of its code point. Because of variable length those encodings are not useful for general string manipulation bit just for communication and storage. String manipulation would be very inefficient (just consider the speed of #size, which is used everywhere). I would therefore use strings with pure Unicode content internally and put all encoding/decoding on the periphery of the image - to interfaces to the external world. As Subbukk already suggested we could put that to an UTF8Stream? VW and Gemstone also put encodings out of string, to separate Encoders and the EncodedStream. They are also depreciating usage of EncodedByteStrings like ISO88591String, MACString etc. Why should then introduce them to Squeak now? UT8 encoding/decoding is very efficient by design, therefore we must make it efficient in Squeak too. It must be almost as fast as a simple copy. And for those who still want to have UTF8 encoded string they can store them in plain ByteString anyway... I hope this clarify my ideas a bit. Best regards Janko Andreas Raab wrote: > Hi Janko - > > Just as a comment from the sidelines I think that concentrating on the > size of the character in an encoding is a mistake. It is really the > encoding that matters and if it weren't impractical I would rename > ByteString to Latin1String and WideString to UTF32String or so. > > This makes it much clearer that we are interested more in the encoding > than the number of bytes per character (although of course some > encodings imply a character size) and this "encoding-driven" view of > strings makes it perfectly natural to think of an UTF8String which has a > variable sized encoding and can live in perfect harmony with the other > "byte encoded strings". > > In your case, I would rather suggest having a class UTF16String instead > of TwoByteString. A good starting point (if you are planning to spend > any time on this) would be to create a class EncodedString which > captures the basics of conversions between differently encoded strings > and start defining a few (trivial) subclasses like mentioned above. From > there, you could extend this to UTF-8, UTF-16 and whatever else encoding > you need. > > Cheers, > - Andreas > > Janko Mivšek wrote: >> Janko Mivšek wrote: >>> I would propose a hibrid solution: three subclasses of String: >>> >>> 1. ByteString for ASCII (native english speakers >>> 2. TwoByteString for most of other languages >>> 3. FourByteString(WideString) for Japanese/Chinese/and others >> >> Let me be more exact about that proposal: >> >> This is for internal representation only, for interfacing to external >> world we need to convert to/from (at least) UTF-8 representation. >> >> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe). >> ByteString is therefore always regarded as encoded in ISO8859-1 >> codepage, which is the same as Unicode Basic Latin (1). >> >> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more >> (so called Basic Multilingual Pane (2)). Encoding of that string >> would correspond to UCS-2, even that it is considered obsolete (3) >> >> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding >> of that string would therefore correspond to UCS-4/UTF-32 (4) >> >> >> I think that this way we can achieve most efficient yet fast support >> for all languages on that world. Because of fixed length those strings >> are also easy to manipulate contrary to variable length UTF-8 ones. >> >> Conversion to/form UTF-8 could probably also be simpler with help of >> bit arithmetic algorithms, which would be tailored differently for >> each of proposed three string subclasses above. >> >> >> (1) Wikipedia Unicode: Storage, transfer, and processing >> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing >> >> (2) Wikipedia Basic Multilingual Pane >> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane >> (3) Wikipedia UTF-16/UCS-2: >> http://en.wikipedia.org/wiki/UCS-2 >> (4) Wikipedia UTF-32/UCS-4 >> http://en.wikipedia.org/wiki/UTF-32/UCS-4 >> >> Best regards >> Janko >> -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by Colin Putney
Hi Colin,
Colin Putney wrote: > > On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote: > >> I think that this way we can achieve most efficient yet fast support >> for all languages on that world. Because of fixed length those strings >> are also easy to manipulate contrary to variable length UTF-8 ones. > > "Most efficient yet fast" is a matter of perspective. For the apps I > work on, UTF-8 is better than your scheme because space efficiency is > more important than random access, and time spent encoding and decoding > UTF-8 would dwarf time spent scanning for random access. Anyone can definitively stay with UTF8 encoded strings in plan BytString or subclass to UTF8String by himself. But I don't know why we need to have UTF8String as part of string framework. Just because of meaning? Then we also need to introduce an ASCIIString :) > As soon as you try to support more than 256 characters, there are > trade-offs to be made. The "ideal" solution depends on your application. > How important is memory efficiency vs. space efficiency? How about > stream processing vs random access? What format is your input and > output? Which characters do you need to support, and how many of them > are there? > A good string library will be flexible enough to allow its users to > make those trade-offs according to the needs of the application. I think that preserving simplicity is also an important goal. We need to find a general yet simple solution for Unicode Strings, which will be good enough for most uses, as is the case for numbers for instance. We deal with more special cases separately. I claim that pure Unicode strings in Byte, TwoByte or FourByteString is such a general support. UTF8String is already a specific one. > >> Conversion to/form UTF-8 could probably also be simpler with help of >> bit arithmetic algorithms, which would be tailored differently for >> each of proposed three string subclasses above. > > Yes, a couple of well designed primitives would help quite a bit. I study UTF8 conversion and it is designed to be efficient, almost as usual copy. I already did those conversion methods and now I'm preparing for benchmarks. If conversion is really as fast as a copy, then there are really not much arguments anymore to convert always to inner Unicode by default? Best regards JAnko -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by Janko Mivšek
So except for the missing 16-bit optimization this is exactly as we
do have now, right? So what is the actual proposal? - Bert - On Jun 11, 2007, at 12:35 , Janko Mivšek wrote: > Hi Andreas, > > Let me start with a statement that Unicode is a generalization of > ASCII. ASCII has code points < 128 and therefore always fits in one > byte while Unicode can have 2, 3 or even 4 bytes wide code points. > > No one treats ASCII strings as ASCII "encoded" therefore no one > should treat Unicode strings as encoded too. And this is an idea > behind my proposal - to have Unicode strings as collections of > character code points, with different byte widths. > > Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) > which all fit to one byte. ByteStrings which contain plain ASCII > are therefore already Unicode! Same with Latin 1 ones. It is > therefore just natural to extend Unicode from byte to two and four > byte strings to cover all code points. For an user this string is > still a string as it was when it was just ASCII. This approach is > therefore also most consistent one. > > When we are talking about Unicode "encodings" we mean UTF (Unicode > Transformation Format). There is UTF-8, UTF-16 and UTF-32. First > ones are both variable length formats, which means that character > size is not the same as byte size and it cannot be just simply > calculated from it. Each character character may be 1, 2, 3 or 4 > bytes depending of the width of its code point. > > Because of variable length those encodings are not useful for > general string manipulation bit just for communication and storage. > String manipulation would be very inefficient (just consider the > speed of #size, which is used everywhere). > > I would therefore use strings with pure Unicode content internally > and put all encoding/decoding on the periphery of the image - to > interfaces to the external world. As Subbukk already suggested we > could put that to an UTF8Stream? > > VW and Gemstone also put encodings out of string, to separate > Encoders and the EncodedStream. They are also depreciating usage of > EncodedByteStrings like ISO88591String, MACString etc. Why should > then introduce them to Squeak now? > > UT8 encoding/decoding is very efficient by design, therefore we > must make it efficient in Squeak too. It must be almost as fast as > a simple copy. > > And for those who still want to have UTF8 encoded string they can > store them in plain ByteString anyway... > > I hope this clarify my ideas a bit. > > Best regards > Janko > > Andreas Raab wrote: >> Hi Janko - >> Just as a comment from the sidelines I think that concentrating on >> the size of the character in an encoding is a mistake. It is >> really the encoding that matters and if it weren't impractical I >> would rename ByteString to Latin1String and WideString to >> UTF32String or so. >> This makes it much clearer that we are interested more in the >> encoding than the number of bytes per character (although of >> course some encodings imply a character size) and this "encoding- >> driven" view of strings makes it perfectly natural to think of an >> UTF8String which has a variable sized encoding and can live in >> perfect harmony with the other "byte encoded strings". >> In your case, I would rather suggest having a class UTF16String >> instead of TwoByteString. A good starting point (if you are >> planning to spend any time on this) would be to create a class >> EncodedString which captures the basics of conversions between >> differently encoded strings and start defining a few (trivial) >> subclasses like mentioned above. From there, you could extend this >> to UTF-8, UTF-16 and whatever else encoding you need. >> Cheers, >> - Andreas >> Janko Mivšek wrote: >>> Janko Mivšek wrote: >>>> I would propose a hibrid solution: three subclasses of String: >>>> >>>> 1. ByteString for ASCII (native english speakers >>>> 2. TwoByteString for most of other languages >>>> 3. FourByteString(WideString) for Japanese/Chinese/and others >>> >>> Let me be more exact about that proposal: >>> >>> This is for internal representation only, for interfacing to >>> external world we need to convert to/from (at least) UTF-8 >>> representation. >>> >>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe). >>> ByteString is therefore always regarded as encoded in ISO8859-1 >>> codepage, which is the same as Unicode Basic Latin (1). >>> >>> 2. TwoByteString for East European Latin, Greek, Cyrillic and >>> many more >>> (so called Basic Multilingual Pane (2)). Encoding of that string >>> would correspond to UCS-2, even that it is considered obsolete >>> (3) >>> >>> 3. FourByteString for Chinese/Japanese/Korean and some others. >>> Encoding >>> of that string would therefore correspond to UCS-4/UTF-32 (4) >>> >>> >>> I think that this way we can achieve most efficient yet fast >>> support for all languages on that world. Because of fixed length >>> those strings are also easy to manipulate contrary to variable >>> length UTF-8 ones. >>> >>> Conversion to/form UTF-8 could probably also be simpler with help >>> of bit arithmetic algorithms, which would be tailored differently >>> for each of proposed three string subclasses above. >>> >>> >>> (1) Wikipedia Unicode: Storage, transfer, and processing >>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer. >>> 2C_and_processing >>> (2) Wikipedia Basic Multilingual Pane >>> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane >>> (3) Wikipedia UTF-16/UCS-2: >>> http://en.wikipedia.org/wiki/UCS-2 >>> (4) Wikipedia UTF-32/UCS-4 >>> http://en.wikipedia.org/wiki/UTF-32/UCS-4 >>> >>> Best regards >>> Janko >>> > > > -- > Janko Mivšek > AIDA/Web > Smalltalk Web Application Server > http://www.aidaweb.si > |
Bert Freudenberg wrote:
> So except for the missing 16-bit optimization this is exactly as we do > have now, right? So what is the actual proposal? Exactly. There is already a WideString and my proposal is just to introduce a TwoByteString and rename WideString to FourByteString for consistency. That way we'll cover all Unicode strings as efficiently as possible yet manageable with string manipulations like usual strings. But main point of my proposal is to treat internal strings as Unicode and only Unicode and nothing else. All other encodings must be converted to Unicode at the borders of an image. Those conversions could be done with separate Encoders or EncodedStreams. It seems that this was already a Yoshiki idea with WideString, so I'm just extending that idea with a TwoByteString to cover 16 bits too. Yoshiki, am I right? Best regards Janko > > - Bert - > > On Jun 11, 2007, at 12:35 , Janko Mivšek wrote: > >> Hi Andreas, >> >> Let me start with a statement that Unicode is a generalization of >> ASCII. ASCII has code points < 128 and therefore always fits in one >> byte while Unicode can have 2, 3 or even 4 bytes wide code points. >> >> No one treats ASCII strings as ASCII "encoded" therefore no one should >> treat Unicode strings as encoded too. And this is an idea behind my >> proposal - to have Unicode strings as collections of character code >> points, with different byte widths. >> >> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) >> which all fit to one byte. ByteStrings which contain plain ASCII are >> therefore already Unicode! Same with Latin 1 ones. It is therefore >> just natural to extend Unicode from byte to two and four byte strings >> to cover all code points. For an user this string is still a string as >> it was when it was just ASCII. This approach is therefore also most >> consistent one. >> >> When we are talking about Unicode "encodings" we mean UTF (Unicode >> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones >> are both variable length formats, which means that character size is >> not the same as byte size and it cannot be just simply calculated from >> it. Each character character may be 1, 2, 3 or 4 bytes depending of >> the width of its code point. >> >> Because of variable length those encodings are not useful for general >> string manipulation bit just for communication and storage. String >> manipulation would be very inefficient (just consider the speed of >> #size, which is used everywhere). >> >> I would therefore use strings with pure Unicode content internally and >> put all encoding/decoding on the periphery of the image - to >> interfaces to the external world. As Subbukk already suggested we >> could put that to an UTF8Stream? >> >> VW and Gemstone also put encodings out of string, to separate Encoders >> and the EncodedStream. They are also depreciating usage of >> EncodedByteStrings like ISO88591String, MACString etc. Why should then >> introduce them to Squeak now? >> >> UT8 encoding/decoding is very efficient by design, therefore we must >> make it efficient in Squeak too. It must be almost as fast as a simple >> copy. >> >> And for those who still want to have UTF8 encoded string they can >> store them in plain ByteString anyway... >> >> I hope this clarify my ideas a bit. >> >> Best regards >> Janko >> >> Andreas Raab wrote: >>> Hi Janko - >>> Just as a comment from the sidelines I think that concentrating on >>> the size of the character in an encoding is a mistake. It is really >>> the encoding that matters and if it weren't impractical I would >>> rename ByteString to Latin1String and WideString to UTF32String or so. >>> This makes it much clearer that we are interested more in the >>> encoding than the number of bytes per character (although of course >>> some encodings imply a character size) and this "encoding-driven" >>> view of strings makes it perfectly natural to think of an UTF8String >>> which has a variable sized encoding and can live in perfect harmony >>> with the other "byte encoded strings". >>> In your case, I would rather suggest having a class UTF16String >>> instead of TwoByteString. A good starting point (if you are planning >>> to spend any time on this) would be to create a class EncodedString >>> which captures the basics of conversions between differently encoded >>> strings and start defining a few (trivial) subclasses like mentioned >>> above. From there, you could extend this to UTF-8, UTF-16 and >>> whatever else encoding you need. >>> Cheers, >>> - Andreas >>> Janko Mivšek wrote: >>>> Janko Mivšek wrote: >>>>> I would propose a hibrid solution: three subclasses of String: >>>>> >>>>> 1. ByteString for ASCII (native english speakers >>>>> 2. TwoByteString for most of other languages >>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others >>>> >>>> Let me be more exact about that proposal: >>>> >>>> This is for internal representation only, for interfacing to >>>> external world we need to convert to/from (at least) UTF-8 >>>> representation. >>>> >>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe). >>>> ByteString is therefore always regarded as encoded in ISO8859-1 >>>> codepage, which is the same as Unicode Basic Latin (1). >>>> >>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more >>>> (so called Basic Multilingual Pane (2)). Encoding of that string >>>> would correspond to UCS-2, even that it is considered obsolete (3) >>>> >>>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding >>>> of that string would therefore correspond to UCS-4/UTF-32 (4) >>>> >>>> >>>> I think that this way we can achieve most efficient yet fast support >>>> for all languages on that world. Because of fixed length those >>>> strings are also easy to manipulate contrary to variable length >>>> UTF-8 ones. >>>> >>>> Conversion to/form UTF-8 could probably also be simpler with help of >>>> bit arithmetic algorithms, which would be tailored differently for >>>> each of proposed three string subclasses above. >>>> >>>> >>>> (1) Wikipedia Unicode: Storage, transfer, and processing >>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing >>>> >>>> (2) Wikipedia Basic Multilingual Pane >>>> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane >>>> (3) Wikipedia UTF-16/UCS-2: >>>> http://en.wikipedia.org/wiki/UCS-2 >>>> (4) Wikipedia UTF-32/UCS-4 >>>> http://en.wikipedia.org/wiki/UTF-32/UCS-4 >>>> >>>> Best regards >>>> Janko >>>> >> >> >> --Janko Mivšek >> AIDA/Web >> Smalltalk Web Application Server >> http://www.aidaweb.si >> > > > > -- Janko Mivšek AIDA/Web Smalltalk Web Application Server http://www.aidaweb.si |
In reply to this post by Janko Mivšek
On Mon, 2007-06-11 at 12:35 +0200, Janko Mivšek wrote:
> Hi Andreas, > > Let me start with a statement that Unicode is a generalization of ASCII. > ASCII has code points < 128 and therefore always fits in one byte while > Unicode can have 2, 3 or even 4 bytes wide code points. > > No one treats ASCII strings as ASCII "encoded" therefore no one should > treat Unicode strings as encoded too. And this is an idea behind my > proposal - to have Unicode strings as collections of character code > points, with different byte widths. > > Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which > all fit to one byte. ByteStrings which contain plain ASCII are > therefore already Unicode! Same with Latin 1 ones. It is therefore just > natural to extend Unicode from byte to two and four byte strings to > cover all code points. For an user this string is still a string as it > was when it was just ASCII. This approach is therefore also most > consistent one. > > When we are talking about Unicode "encodings" we mean UTF (Unicode > Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones > are both variable length formats, which means that character size is not > the same as byte size and it cannot be just simply calculated from it. > Each character character may be 1, 2, 3 or 4 bytes depending of the > width of its code point. > > Because of variable length those encodings are not useful for general > string manipulation bit just for communication and storage. String > manipulation would be very inefficient (just consider the speed of > #size, which is used everywhere). > > I would therefore use strings with pure Unicode content internally and > put all encoding/decoding on the periphery of the image - to interfaces > to the external world. As Subbukk already suggested we could put that > to an UTF8Stream? > > VW and Gemstone also put encodings out of string, to separate Encoders > and the EncodedStream. They are also depreciating usage of > EncodedByteStrings like ISO88591String, MACString etc. Why should then > introduce them to Squeak now? > > UT8 encoding/decoding is very efficient by design, therefore we must > make it efficient in Squeak too. It must be almost as fast as a simple copy. > > And for those who still want to have UTF8 encoded string they can store > them in plain ByteString anyway... > > I hope this clarify my ideas a bit. > Norbert |
In reply to this post by Janko Mivšek
On Jun 11, 2007, at 4:04 AM, Janko Mivšek wrote: > Anyone can definitively stay with UTF8 encoded strings in plan > BytString or subclass to UTF8String by himself. But I don't know > why we need to have UTF8String as part of string framework. Just > because of meaning? Then we also need to introduce an ASCIIString :) > I think that preserving simplicity is also an important goal. We > need to find a general yet simple solution for Unicode Strings, > which will be good enough for most uses, as is the case for numbers > for instance. We deal with more special cases separately. I claim > that pure Unicode strings in Byte, TwoByte or FourByteString is > such a general support. UTF8String is already a specific one. Ok, so what you're saying is this: ByteString, TwoByteString and FourByteString are good enough for the most purposes. Web developers and anyone else that needs to work with other encodings should roll their own solutions, so as not to burden the rest of the community with clutter caused by support for other encodings, or even hooks to make such things easy to integrate with the base string code. Is that a fair characterization of your position? Colin |
In reply to this post by Janko Mivšek
Janko,
> It seems that this was already a Yoshiki idea with WideString, so I'm > just extending that idea with a TwoByteString to cover 16 bits too. > > Yoshiki, am I right? For storing the bare Unicode code points, I think so. I'm not convinced that adding 16-bit variation solves any real problems. But there may be something. My first a few questions are: - While vast majority of strings for, say, Japanese can be represented with in the characters in BMP, you would use FourByteString for Chinese/Japanese/Korean and some others. Does this mean that you would *always* use FourByteString for these "languages" (and not scripts?) - Suppose you would like to use different line wrapping algorithms for different languages, how would you keep that information? -- Yoshiki |
Yoshiki Ohshima wrote:
> Janko, > >> It seems that this was already a Yoshiki idea with WideString, so I'm >> just extending that idea with a TwoByteString to cover 16 bits too. >> >> Yoshiki, am I right? > > For storing the bare Unicode code points, I think so. I'm not > convinced that adding 16-bit variation solves any real problems. But > there may be something. A lot of text is basically 8-bit, *except* for the occasional wide dash etc, blowing up the text to 32-bit although 16 would be more than enough. > - Suppose you would like to use different line wrapping algorithms > for different languages, how would you keep that information? The question is which, if any, language dependent (text layout?!) attributes should be encoded into the String rather than kept as text attributes. Michael |
Free forum by Nabble | Edit this page |