New Win32 VM [m17n testers needed]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
124 messages Options
1234567
Reply | Threaded
Open this post in threaded view
|

Re: Image as a database (was Re: UTF8 Squeak)

Lukas Renggli
> I admit that I came from VW where I'm running quite a number of web apps
> on images which servers also as a sole database and that just works,
> reliable and fast.
>
> Now I'm thinking to do the same in Squeak. That is, to use squeak image
> as a database, fast and reliable. Am I too naive?

A lot of people are doing that with Squeak as well. For example all my
Pier instances are using the image as a database. It is (was) the same
for SmallWiki.

It is certainly fast. The reliability depends on the VM you are using.
Some VMs are known to crash if they have been started with the wrong
parameters or if they grow over a certain amount of memory.

Lukas

--
Lukas Renggli
http://www.lukas-renggli.ch

Reply | Threaded
Open this post in threaded view
|

RE: UTF8 Squeak

Alan L. Lovejoy
In reply to this post by Philippe Marschall
<Philippe>Well, there is what the evil language with J does: UCS2
everywhere, no excuses. This is a bit awkward for characters outside the BMP
(which are more rare than unicorns) but IIRC the astral planes didn't exits
when it was created. So you could argue for UCS4. Yes, it's twice the size,
but who really cares? If you could get rid of all the size hacks in Squeak
that were cool in the 70ies, would you?</Philippe>

Note: UTF-32 and UCS-4 are different names for the same thing [Reference:
http://en.wikipedia.org/wiki/UTF-32]

There is no one solution that is good enough for all use cases.

UTF-32 is fast for indexed chacter reading/writing. It also comprehensively
covers the entire Unicode Universal Character Set--not just those in the
Basic Multilingual Plane. But it also is not very space efficient.
[Reference: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters.]

Although you could have a different subclass of String for each encoding,
that's a poor use of inheritance.  It's better to have a single String class
that uses an associated Strategy object (stored in one of the instance
variables of a String--the other holding a ByteArray containing the
characters.) The CharacterEncoding class would have a subclass for each
different encoding.  The byteArray would hold the String's data, whose
character content would be interpreted by the Strategy object (an instance
of CharacterEncoding.)

To achieve semantic unification across any and all character encodings, the
rule would be that when a Character object is reified from a String, it
always uses the Unicode code point ("integer code value.")  And when a
Character is "put:" into a String, its canonical (Unicode) code point is
translated to be correct for that String's encoding.  Both conversions would
be the responsibility of the String's Strategy object (an instance of
CharacterEncoding.)

This implementation architecture lets each application (or
package/module/code-library) choose the encoding that best suits its use
case, but prevents character code mapping errors when characters are copied
between Strings whose encodings are not the same.

In the case of the variable-byte encodings, it might be possible to achieve
significant performance improvements by having the CharacterEncoding
instance store information that helps to more quickly translate between
logical character indices and physical byte indices within the String's
ByteArray (the RunArray of a Text is a good analogy for what I have in mind
here.)

--Alan



Reply | Threaded
Open this post in threaded view
|

Re: Image as a database (was Re: UTF8 Squeak)

Ramiro Diaz Trepat
In reply to this post by Lukas Renggli
If you run applications that don't need to execute transactions or
care a lot to lose some part of the objects if the application
crashes, and you also find a nice way to share the instances of your
object model among different running instances (very common in Seaside
to run several images).  Although I have never used VW, it is probably
not a lot different, since Seaside apps can be pretty stable.
But if you need to execute "persistent" transactions, then take a look
at Magma, Glorp, GOODS or even the free version of GemStone.



r.


On 6/9/07, Lukas Renggli <[hidden email]> wrote:

> > I admit that I came from VW where I'm running quite a number of web apps
> > on images which servers also as a sole database and that just works,
> > reliable and fast.
> >
> > Now I'm thinking to do the same in Squeak. That is, to use squeak image
> > as a database, fast and reliable. Am I too naive?
>
> A lot of people are doing that with Squeak as well. For example all my
> Pier instances are using the image as a database. It is (was) the same
> for SmallWiki.
>
> It is certainly fast. The reliability depends on the VM you are using.
> Some VMs are known to crash if they have been started with the wrong
> parameters or if they grow over a certain amount of memory.
>
> Lukas
>
> --
> Lukas Renggli
> http://www.lukas-renggli.ch
>
>

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

tblanchard
In reply to this post by K. K. Subramaniam
FWIW, I believe this is how NSString works on the Mac and it has been unicode capable for a very long time.

On Jun 9, 2007, at 1:54 PM, Alan Lovejoy wrote:

 It's better to have a single String class

that uses an associated Strategy object (stored in one of the instance

variables of a String--the other holding a ByteArray containing the

characters.)




Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Janko Mivšek
Janko Mivšek wrote:
> I would propose a hibrid solution: three subclasses of String:
>
> 1. ByteString for ASCII (native english speakers
> 2. TwoByteString for most of other languages
> 3. FourByteString(WideString) for Japanese/Chinese/and others

Let me be more exact about that proposal:

This is for internal representation only, for interfacing to external
world we need to convert to/from (at least) UTF-8 representation.

1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
    ByteString  is therefore always regarded as encoded in ISO8859-1
    codepage, which is the same as Unicode Basic Latin (1).

2. TwoByteString for East European Latin, Greek, Cyrillic and many more
    (so called Basic Multilingual Pane (2)). Encoding of that string
    would correspond to UCS-2, even that it is considered obsolete (3)

3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
    of that string would therefore correspond to UCS-4/UTF-32 (4)


I think that this way we can achieve most efficient yet fast support for
all languages on that world. Because of fixed length those strings are
also easy to manipulate contrary to variable length UTF-8 ones.

Conversion to/form UTF-8 could probably also be simpler with help of bit
arithmetic algorithms, which would be tailored differently for each of
proposed three string subclasses above.


(1) Wikipedia Unicode: Storage, transfer, and processing
http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
(2) Wikipedia Basic Multilingual Pane
     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
(3) Wikipedia UTF-16/UCS-2:
     http://en.wikipedia.org/wiki/UCS-2
(4) Wikipedia UTF-32/UCS-4
     http://en.wikipedia.org/wiki/UTF-32/UCS-4

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Colin Putney

On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:

> I think that this way we can achieve most efficient yet fast  
> support for all languages on that world. Because of fixed length  
> those strings are also easy to manipulate contrary to variable  
> length UTF-8 ones.

"Most efficient yet fast" is a matter of perspective. For the apps I  
work on, UTF-8 is better than your scheme because space efficiency is  
more important than random access, and time spent encoding and  
decoding UTF-8 would dwarf time spent scanning for random access.

As soon as you try to support more than 256 characters, there are  
trade-offs to be made. The "ideal" solution depends on your  
application. How important is memory efficiency vs. space efficiency?  
How about stream processing vs random access? What format is your  
input and output? Which characters do you need to support, and how  
many of them are there?

A good string library will be flexible enough to allow its users to  
make those trade-offs according to the needs of the application.

> Conversion to/form UTF-8 could probably also be simpler with help  
> of bit arithmetic algorithms, which would be tailored differently  
> for each of proposed three string subclasses above.

Yes, a couple of well designed primitives would help quite a bit.

Colin
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeakn

K. K. Subramaniam
In reply to this post by K. K. Subramaniam
On Saturday 09 June 2007 6:59 am, Yoshiki Ohshima wrote:
>   It is incomplete in many ways.  Sure.  But that wasn't the issue you
> were raising; you were talking about the interface between the image
> and VM but the hard part.
If multilingual support is implemented in platform-specific fashion, then it
will be incomplete till all platform variations are taken into account.
>> a) Use Unicode chars in literals and text fields.
> You can do this already.
I didn't know this. Neither typing it nor cut-n-paste would work for me (VM
3.7.7 on Linux). I am able to use U+0Cxx characters fine in other Linux apps
(UTF-8 aware). Am I missing something in my Squeak setup?
>   We have been doing this many years already.  What we can't do is to
> display Indic characters yet (which will be solved very soon).
Is it working on LInux? for Kannada? I have some public schools that are
waiting for Kannada support. Is there anything I can do to speed it up?
>> f) See 'current language' indicator in input fields.
> What do you mean by "input fields"?
Essentially, Morphs that accept text input. In X/KDE/GNOME, there is a
keyboard layout indicator that shows what keyboard layout is in effect. But
how can a user get hints when Squeak runs fullscreen or in console mode?

Regards .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Andreas.Raab
In reply to this post by Janko Mivšek
Hi Janko -

Just as a comment from the sidelines I think that concentrating on the
size of the character in an encoding is a mistake. It is really the
encoding that matters and if it weren't impractical I would rename
ByteString to Latin1String and WideString to UTF32String or so.

This makes it much clearer that we are interested more in the encoding
than the number of bytes per character (although of course some
encodings imply a character size) and this "encoding-driven" view of
strings makes it perfectly natural to think of an UTF8String which has a
variable sized encoding and can live in perfect harmony with the other
"byte encoded strings".

In your case, I would rather suggest having a class UTF16String instead
of TwoByteString. A good starting point (if you are planning to spend
any time on this) would be to create a class EncodedString which
captures the basics of conversions between differently encoded strings
and start defining a few (trivial) subclasses like mentioned above. From
there, you could extend this to UTF-8, UTF-16 and whatever else encoding
you need.

Cheers,
   - Andreas

Janko Mivšek wrote:

> Janko Mivšek wrote:
>> I would propose a hibrid solution: three subclasses of String:
>>
>> 1. ByteString for ASCII (native english speakers
>> 2. TwoByteString for most of other languages
>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>
> Let me be more exact about that proposal:
>
> This is for internal representation only, for interfacing to external
> world we need to convert to/from (at least) UTF-8 representation.
>
> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>    ByteString  is therefore always regarded as encoded in ISO8859-1
>    codepage, which is the same as Unicode Basic Latin (1).
>
> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>    (so called Basic Multilingual Pane (2)). Encoding of that string
>    would correspond to UCS-2, even that it is considered obsolete (3)
>
> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>
>
> I think that this way we can achieve most efficient yet fast support for
> all languages on that world. Because of fixed length those strings are
> also easy to manipulate contrary to variable length UTF-8 ones.
>
> Conversion to/form UTF-8 could probably also be simpler with help of bit
> arithmetic algorithms, which would be tailored differently for each of
> proposed three string subclasses above.
>
>
> (1) Wikipedia Unicode: Storage, transfer, and processing
> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
> (2) Wikipedia Basic Multilingual Pane
>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> (3) Wikipedia UTF-16/UCS-2:
>     http://en.wikipedia.org/wiki/UCS-2
> (4) Wikipedia UTF-32/UCS-4
>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>
> Best regards
> Janko
>


Reply | Threaded
Open this post in threaded view
|

Re: Image as a database (was Re: UTF8 Squeak)

Michael van der Gulik-2
In reply to this post by Janko Mivšek


On 6/10/07, Janko Mivšek <[hidden email]> wrote:
Hi Phillipe,

Philippe Marschall wrote:
> 2007/6/9, Janko Mivšek <[hidden email]>:
>> Philippe Marschall wrote:
>> All of us who use image as a database care about space efficiency but on
>> the other side we want all normal string operations to run on unicode
>> strings too.
>
> The image is not an efficient database. It stores all kind of "crap"
> like Morphs.

The crap is going to be there anyway if you're running some sort of live image with a database attached. If you don't want the crap then you can remove it from the image or use a minimal image.
 

Now I'm thinking to do the same in Squeak. That is, to use squeak image
as a database, fast and reliable. Am I too naive?

You have to have plans for when an image becomes corrupted.

I don't think it's too difficult to have the image automatically save itself - say for example once every 5 minutes, once an hour, once a day and once a week. That way if an image becomes corrupted, you can go back to the latest known working image.

If data is very important to your application, you could write some code to log changes to a file (aka squeak.changes) and manually read them back in when disaster occurs.

Michael.


Reply | Threaded
Open this post in threaded view
|

Re: Image as a database (was Re: UTF8 Squeak)

Philippe Marschall
2007/6/11, Michael van der Gulik <[hidden email]>:

>
>
> On 6/10/07, Janko Mivšek <[hidden email]> wrote:
> > Hi Phillipe,
> >
> > Philippe Marschall wrote:
> > > 2007/6/9, Janko Mivšek <[hidden email]>:
> > >> Philippe Marschall wrote:
> > >> All of us who use image as a database care about space efficiency but
> on
> > >> the other side we want all normal string operations to run on unicode
> > >> strings too.
> > >
> > > The image is not an efficient database. It stores all kind of "crap"
> > > like Morphs.
>
> The crap is going to be there anyway if you're running some sort of live
> image with a database attached. If you don't want the crap then you can
> remove it from the image or use a minimal image.
>
> > Now I'm thinking to do the same in Squeak. That is, to use squeak image
> > as a database, fast and reliable. Am I too naive?
> >
>
> You have to have plans for when an image becomes corrupted.
>
> I don't think it's too difficult to have the image automatically save itself
> - say for example once every 5 minutes, once an hour, once a day and once a
> week. That way if an image becomes corrupted, you can go back to the latest
> known working image.
>
> If data is very important to your application, you could write some code to
> log changes to a file (aka squeak.changes) and manually read them back in
> when disaster occurs.
AKA transaction log DIY.

Philippe

> Michael.
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

K. K. Subramaniam
In reply to this post by Andreas.Raab
On Monday 11 June 2007 2:44 am, Andreas Raab wrote:
> Hi Janko -
>.... this "encoding-driven" view of
> strings makes it perfectly natural to think of an UTF8String which has a
> variable sized encoding and can live in perfect harmony with the other
> "byte encoded strings".
How about UTF8Stream since UTF8 works best in a stream? The term String has
strong connotations of storage and indexing.

Regards .. Subbu

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Klaus D. Witzel
On Mon, 11 Jun 2007 09:15:16 +0200, subbukk wrote:
> On Monday 11 June 2007 2:44 am, Andreas Raab wrote:
>> Hi Janko -
>> .... this "encoding-driven" view of
>> strings makes it perfectly natural to think of an UTF8String which has a
>> variable sized encoding and can live in perfect harmony with the other
>> "byte encoded strings".
> How about UTF8Stream since UTF8 works best in a stream? The term String  
> has strong connotations of storage and indexing.

How about the ages old Smalltalk solution: a decorated string is a Text.  
The stringArray and the runArray of Text together determine the face of  
the character. Then you'd have UTF8Text, the stringArray of which can  
contain plain ASCII if there are no other characters in a UTF8Text. And  
Dan's double dispatch concept can be employed for the match/replace/etc  
functions.

Just my CHF 0.05

/Klaus

> Regards .. Subbu
>
>



Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Andreas.Raab
Hi Andreas,

Let me start with a statement that Unicode is a generalization of ASCII.
ASCII has code points < 128 and therefore always fits in one byte while
Unicode can have 2, 3 or even 4 bytes wide code points.

No one treats ASCII strings as ASCII "encoded" therefore no one should
treat Unicode strings as encoded too. And this is an idea behind my
proposal - to have Unicode strings as collections of character code
points, with different byte widths.

Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which
  all fit to one byte. ByteStrings which contain plain ASCII are
therefore already Unicode! Same with Latin 1 ones. It is therefore just
natural to extend Unicode from byte to two and four byte strings to
cover all code points. For an user this string is still a string as it
was when it was just ASCII. This approach is therefore also most
consistent one.

When we are talking about Unicode "encodings" we mean UTF (Unicode
Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
are both variable length formats, which means that character size is not
the same as byte size and it cannot be just simply calculated from it.
Each character character may be 1, 2, 3 or 4 bytes depending of the
width of its code point.

Because of variable length those encodings are not useful for general
string manipulation bit just for communication and storage. String
manipulation would be very inefficient (just consider the speed of
#size, which is used everywhere).

I would therefore use strings with pure Unicode content internally and
put all encoding/decoding on the periphery of the image - to interfaces
to the external world. As Subbukk already suggested we could put that
to an UTF8Stream?

VW and Gemstone also put encodings out of string, to separate Encoders
and the EncodedStream. They are also depreciating usage of
EncodedByteStrings like ISO88591String, MACString etc. Why should then
introduce them to Squeak now?

UT8 encoding/decoding is very efficient by design, therefore we must
make it efficient in Squeak too. It must be almost as fast as a simple copy.

And for those who still want to have UTF8 encoded string they can store
them in plain ByteString anyway...

I hope this clarify my ideas a bit.

Best regards
Janko

Andreas Raab wrote:

> Hi Janko -
>
> Just as a comment from the sidelines I think that concentrating on the
> size of the character in an encoding is a mistake. It is really the
> encoding that matters and if it weren't impractical I would rename
> ByteString to Latin1String and WideString to UTF32String or so.
>
> This makes it much clearer that we are interested more in the encoding
> than the number of bytes per character (although of course some
> encodings imply a character size) and this "encoding-driven" view of
> strings makes it perfectly natural to think of an UTF8String which has a
> variable sized encoding and can live in perfect harmony with the other
> "byte encoded strings".
>
> In your case, I would rather suggest having a class UTF16String instead
> of TwoByteString. A good starting point (if you are planning to spend
> any time on this) would be to create a class EncodedString which
> captures the basics of conversions between differently encoded strings
> and start defining a few (trivial) subclasses like mentioned above. From
> there, you could extend this to UTF-8, UTF-16 and whatever else encoding
> you need.
>
> Cheers,
>   - Andreas
>
> Janko Mivšek wrote:
>> Janko Mivšek wrote:
>>> I would propose a hibrid solution: three subclasses of String:
>>>
>>> 1. ByteString for ASCII (native english speakers
>>> 2. TwoByteString for most of other languages
>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>
>> Let me be more exact about that proposal:
>>
>> This is for internal representation only, for interfacing to external
>> world we need to convert to/from (at least) UTF-8 representation.
>>
>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>    ByteString  is therefore always regarded as encoded in ISO8859-1
>>    codepage, which is the same as Unicode Basic Latin (1).
>>
>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>>    (so called Basic Multilingual Pane (2)). Encoding of that string
>>    would correspond to UCS-2, even that it is considered obsolete (3)
>>
>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>>
>>
>> I think that this way we can achieve most efficient yet fast support
>> for all languages on that world. Because of fixed length those strings
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>>
>> Conversion to/form UTF-8 could probably also be simpler with help of
>> bit arithmetic algorithms, which would be tailored differently for
>> each of proposed three string subclasses above.
>>
>>
>> (1) Wikipedia Unicode: Storage, transfer, and processing
>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing 
>>
>> (2) Wikipedia Basic Multilingual Pane
>>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>> (3) Wikipedia UTF-16/UCS-2:
>>     http://en.wikipedia.org/wiki/UCS-2
>> (4) Wikipedia UTF-32/UCS-4
>>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>
>> Best regards
>> Janko
>>


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
In reply to this post by Colin Putney
Hi Colin,

Colin Putney wrote:

>
> On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:
>
>> I think that this way we can achieve most efficient yet fast support
>> for all languages on that world. Because of fixed length those strings
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>
> "Most efficient yet fast" is a matter of perspective. For the apps I
> work on, UTF-8 is better than your scheme because space efficiency is
> more important than random access, and time spent encoding and decoding
> UTF-8 would dwarf time spent scanning for random access.

Anyone can definitively stay with UTF8 encoded strings in plan BytString
or subclass to UTF8String by himself. But I don't know why we need to
have UTF8String as part of string framework. Just because of meaning?
Then we also need to introduce an ASCIIString :)

> As soon as you try to support more than 256 characters, there are
> trade-offs to be made. The "ideal" solution depends on your application.
> How important is memory efficiency vs. space efficiency? How about
> stream processing vs random access? What format is your input and
> output? Which characters do you need to support, and how many of them
> are there?
 > A good string library will be flexible enough to allow its users to
 > make those trade-offs according to the needs of the application.

I think that preserving simplicity is also an important goal. We need to
find a general yet simple solution for Unicode Strings, which will be
good enough for most uses, as is the case for numbers for instance. We
deal with more special cases separately. I claim that pure Unicode
strings in Byte, TwoByte or FourByteString is such a general support.
UTF8String is already a specific one.
>
>> Conversion to/form UTF-8 could probably also be simpler with help of
>> bit arithmetic algorithms, which would be tailored differently for
>> each of proposed three string subclasses above.
>
> Yes, a couple of well designed primitives would help quite a bit.

I study UTF8 conversion and it is designed to be efficient, almost as
usual copy. I already did those conversion methods and now I'm preparing
for benchmarks. If conversion is really as fast as a copy, then there
are really not much arguments anymore to convert always to inner Unicode
by default?

Best regards
JAnko


--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Bert Freudenberg
In reply to this post by Janko Mivšek
So except for the missing 16-bit optimization this is exactly as we  
do have now, right? So what is the actual proposal?

- Bert -

On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:

> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of  
> ASCII. ASCII has code points < 128 and therefore always fits in one  
> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one  
> should treat Unicode strings as encoded too. And this is an idea  
> behind my proposal - to have Unicode strings as collections of  
> character code points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1)  
> which  all fit to one byte. ByteStrings which contain plain ASCII  
> are therefore already Unicode! Same with Latin 1 ones. It is  
> therefore just natural to extend Unicode from byte to two and four  
> byte strings to cover all code points. For an user this string is  
> still a string as it was when it was just ASCII. This approach is  
> therefore also most consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode  
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First  
> ones are both variable length formats, which means that character  
> size is not the same as byte size and it cannot be just simply  
> calculated from it. Each character character may be 1, 2, 3 or 4  
> bytes depending of the width of its code point.
>
> Because of variable length those encodings are not useful for  
> general string manipulation bit just for communication and storage.  
> String manipulation would be very inefficient (just consider the  
> speed of #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally  
> and put all encoding/decoding on the periphery of the image - to  
> interfaces to the external world. As Subbukk already suggested we  
> could put that to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate  
> Encoders and the EncodedStream. They are also depreciating usage of  
> EncodedByteStrings like ISO88591String, MACString etc. Why should  
> then introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we  
> must make it efficient in Squeak too. It must be almost as fast as  
> a simple copy.
>
> And for those who still want to have UTF8 encoded string they can  
> store them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>
> Best regards
> Janko
>
> Andreas Raab wrote:
>> Hi Janko -
>> Just as a comment from the sidelines I think that concentrating on  
>> the size of the character in an encoding is a mistake. It is  
>> really the encoding that matters and if it weren't impractical I  
>> would rename ByteString to Latin1String and WideString to  
>> UTF32String or so.
>> This makes it much clearer that we are interested more in the  
>> encoding than the number of bytes per character (although of  
>> course some encodings imply a character size) and this "encoding-
>> driven" view of strings makes it perfectly natural to think of an  
>> UTF8String which has a variable sized encoding and can live in  
>> perfect harmony with the other "byte encoded strings".
>> In your case, I would rather suggest having a class UTF16String  
>> instead of TwoByteString. A good starting point (if you are  
>> planning to spend any time on this) would be to create a class  
>> EncodedString which captures the basics of conversions between  
>> differently encoded strings and start defining a few (trivial)  
>> subclasses like mentioned above. From there, you could extend this  
>> to UTF-8, UTF-16 and whatever else encoding you need.
>> Cheers,
>>   - Andreas
>> Janko Mivšek wrote:
>>> Janko Mivšek wrote:
>>>> I would propose a hibrid solution: three subclasses of String:
>>>>
>>>> 1. ByteString for ASCII (native english speakers
>>>> 2. TwoByteString for most of other languages
>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>
>>> Let me be more exact about that proposal:
>>>
>>> This is for internal representation only, for interfacing to  
>>> external world we need to convert to/from (at least) UTF-8  
>>> representation.
>>>
>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>>    ByteString  is therefore always regarded as encoded in ISO8859-1
>>>    codepage, which is the same as Unicode Basic Latin (1).
>>>
>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and  
>>> many more
>>>    (so called Basic Multilingual Pane (2)). Encoding of that string
>>>    would correspond to UCS-2, even that it is considered obsolete  
>>> (3)
>>>
>>> 3. FourByteString for Chinese/Japanese/Korean and some others.  
>>> Encoding
>>>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>
>>>
>>> I think that this way we can achieve most efficient yet fast  
>>> support for all languages on that world. Because of fixed length  
>>> those strings are also easy to manipulate contrary to variable  
>>> length UTF-8 ones.
>>>
>>> Conversion to/form UTF-8 could probably also be simpler with help  
>>> of bit arithmetic algorithms, which would be tailored differently  
>>> for each of proposed three string subclasses above.
>>>
>>>
>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.
>>> 2C_and_processing
>>> (2) Wikipedia Basic Multilingual Pane
>>>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>> (3) Wikipedia UTF-16/UCS-2:
>>>     http://en.wikipedia.org/wiki/UCS-2
>>> (4) Wikipedia UTF-32/UCS-4
>>>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>
>>> Best regards
>>> Janko
>>>
>
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>



Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Janko Mivšek
Bert Freudenberg wrote:
> So except for the missing 16-bit optimization this is exactly as we do
> have now, right? So what is the actual proposal?

Exactly. There is already a WideString and my proposal is just to
introduce a TwoByteString and rename WideString to FourByteString for
consistency.

That way we'll cover all Unicode strings as efficiently as possible yet
manageable with string manipulations like usual strings.

But main point of my proposal is to treat internal strings as Unicode
and only Unicode and nothing else. All other encodings must be converted
to Unicode at the borders of an image. Those conversions could be done
with separate Encoders or EncodedStreams.

It seems that this was already a Yoshiki idea with WideString, so I'm
just extending that idea with a TwoByteString to cover 16 bits too.

Yoshiki, am I right?

Best regards
Janko


>
> - Bert -
>
> On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:
>
>> Hi Andreas,
>>
>> Let me start with a statement that Unicode is a generalization of
>> ASCII. ASCII has code points < 128 and therefore always fits in one
>> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>>
>> No one treats ASCII strings as ASCII "encoded" therefore no one should
>> treat Unicode strings as encoded too. And this is an idea behind my
>> proposal - to have Unicode strings as collections of character code
>> points, with different byte widths.
>>
>> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1)
>> which  all fit to one byte. ByteStrings which contain plain ASCII are
>> therefore already Unicode! Same with Latin 1 ones. It is therefore
>> just natural to extend Unicode from byte to two and four byte strings
>> to cover all code points. For an user this string is still a string as
>> it was when it was just ASCII. This approach is therefore also most
>> consistent one.
>>
>> When we are talking about Unicode "encodings" we mean UTF (Unicode
>> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
>> are both variable length formats, which means that character size is
>> not the same as byte size and it cannot be just simply calculated from
>> it. Each character character may be 1, 2, 3 or 4 bytes depending of
>> the width of its code point.
>>
>> Because of variable length those encodings are not useful for general
>> string manipulation bit just for communication and storage. String
>> manipulation would be very inefficient (just consider the speed of
>> #size, which is used everywhere).
>>
>> I would therefore use strings with pure Unicode content internally and
>> put all encoding/decoding on the periphery of the image - to
>> interfaces to the external world. As Subbukk already suggested we
>> could put that to an UTF8Stream?
>>
>> VW and Gemstone also put encodings out of string, to separate Encoders
>> and the EncodedStream. They are also depreciating usage of
>> EncodedByteStrings like ISO88591String, MACString etc. Why should then
>> introduce them to Squeak now?
>>
>> UT8 encoding/decoding is very efficient by design, therefore we must
>> make it efficient in Squeak too. It must be almost as fast as a simple
>> copy.
>>
>> And for those who still want to have UTF8 encoded string they can
>> store them in plain ByteString anyway...
>>
>> I hope this clarify my ideas a bit.
>>
>> Best regards
>> Janko
>>
>> Andreas Raab wrote:
>>> Hi Janko -
>>> Just as a comment from the sidelines I think that concentrating on
>>> the size of the character in an encoding is a mistake. It is really
>>> the encoding that matters and if it weren't impractical I would
>>> rename ByteString to Latin1String and WideString to UTF32String or so.
>>> This makes it much clearer that we are interested more in the
>>> encoding than the number of bytes per character (although of course
>>> some encodings imply a character size) and this "encoding-driven"
>>> view of strings makes it perfectly natural to think of an UTF8String
>>> which has a variable sized encoding and can live in perfect harmony
>>> with the other "byte encoded strings".
>>> In your case, I would rather suggest having a class UTF16String
>>> instead of TwoByteString. A good starting point (if you are planning
>>> to spend any time on this) would be to create a class EncodedString
>>> which captures the basics of conversions between differently encoded
>>> strings and start defining a few (trivial) subclasses like mentioned
>>> above. From there, you could extend this to UTF-8, UTF-16 and
>>> whatever else encoding you need.
>>> Cheers,
>>>   - Andreas
>>> Janko Mivšek wrote:
>>>> Janko Mivšek wrote:
>>>>> I would propose a hibrid solution: three subclasses of String:
>>>>>
>>>>> 1. ByteString for ASCII (native english speakers
>>>>> 2. TwoByteString for most of other languages
>>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>>
>>>> Let me be more exact about that proposal:
>>>>
>>>> This is for internal representation only, for interfacing to
>>>> external world we need to convert to/from (at least) UTF-8
>>>> representation.
>>>>
>>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>>>    ByteString  is therefore always regarded as encoded in ISO8859-1
>>>>    codepage, which is the same as Unicode Basic Latin (1).
>>>>
>>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>>>>    (so called Basic Multilingual Pane (2)). Encoding of that string
>>>>    would correspond to UCS-2, even that it is considered obsolete (3)
>>>>
>>>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>>>>    of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>>
>>>>
>>>> I think that this way we can achieve most efficient yet fast support
>>>> for all languages on that world. Because of fixed length those
>>>> strings are also easy to manipulate contrary to variable length
>>>> UTF-8 ones.
>>>>
>>>> Conversion to/form UTF-8 could probably also be simpler with help of
>>>> bit arithmetic algorithms, which would be tailored differently for
>>>> each of proposed three string subclasses above.
>>>>
>>>>
>>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing 
>>>>
>>>> (2) Wikipedia Basic Multilingual Pane
>>>>     http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>>> (3) Wikipedia UTF-16/UCS-2:
>>>>     http://en.wikipedia.org/wiki/UCS-2
>>>> (4) Wikipedia UTF-32/UCS-4
>>>>     http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>>
>>>> Best regards
>>>> Janko
>>>>
>>
>>
>> --Janko Mivšek
>> AIDA/Web
>> Smalltalk Web Application Server
>> http://www.aidaweb.si
>>
>
>
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

NorbertHartl
In reply to this post by Janko Mivšek
On Mon, 2007-06-11 at 12:35 +0200, Janko Mivšek wrote:

> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of ASCII.
> ASCII has code points < 128 and therefore always fits in one byte while
> Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one should
> treat Unicode strings as encoded too. And this is an idea behind my
> proposal - to have Unicode strings as collections of character code
> points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which
>   all fit to one byte. ByteStrings which contain plain ASCII are
> therefore already Unicode! Same with Latin 1 ones. It is therefore just
> natural to extend Unicode from byte to two and four byte strings to
> cover all code points. For an user this string is still a string as it
> was when it was just ASCII. This approach is therefore also most
> consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
> are both variable length formats, which means that character size is not
> the same as byte size and it cannot be just simply calculated from it.
> Each character character may be 1, 2, 3 or 4 bytes depending of the
> width of its code point.
>
> Because of variable length those encodings are not useful for general
> string manipulation bit just for communication and storage. String
> manipulation would be very inefficient (just consider the speed of
> #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally and
> put all encoding/decoding on the periphery of the image - to interfaces
> to the external world. As Subbukk already suggested we could put that
> to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate Encoders
> and the EncodedStream. They are also depreciating usage of
> EncodedByteStrings like ISO88591String, MACString etc. Why should then
> introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we must
> make it efficient in Squeak too. It must be almost as fast as a simple copy.
>
> And for those who still want to have UTF8 encoded string they can store
> them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>
Yes, absolutely. And this time I like to fully agree in public :)

Norbert


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Colin Putney
In reply to this post by Janko Mivšek

On Jun 11, 2007, at 4:04 AM, Janko Mivšek wrote:

> Anyone can definitively stay with UTF8 encoded strings in plan  
> BytString or subclass to UTF8String by himself. But I don't know  
> why we need to have UTF8String as part of string framework. Just  
> because of meaning? Then we also need to introduce an ASCIIString :)

> I think that preserving simplicity is also an important goal. We  
> need to find a general yet simple solution for Unicode Strings,  
> which will be good enough for most uses, as is the case for numbers  
> for instance. We deal with more special cases separately. I claim  
> that pure Unicode strings in Byte, TwoByte or FourByteString is  
> such a general support. UTF8String is already a specific one.

Ok, so what you're saying is this: ByteString, TwoByteString and  
FourByteString are good enough for the most purposes. Web developers  
and anyone else that needs to work with other encodings should roll  
their own solutions, so as not to burden the rest of the community  
with clutter caused by support for other encodings, or even hooks to  
make such things easy to integrate with the base string code.

Is that a fair characterization of your position?

Colin
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Yoshiki Ohshima
In reply to this post by Janko Mivšek
  Janko,

> It seems that this was already a Yoshiki idea with WideString, so I'm
> just extending that idea with a TwoByteString to cover 16 bits too.
>
> Yoshiki, am I right?

  For storing the bare Unicode code points, I think so.  I'm not
convinced that adding 16-bit variation solves any real problems.  But
there may be something.

  My first a few questions are:

  - While vast majority of strings for, say, Japanese can be
    represented with in the characters in BMP, you would use
    FourByteString for Chinese/Japanese/Korean and some others.  Does
    this mean that you would *always* use FourByteString for these
    "languages" (and not scripts?)

  - Suppose you would like to use different line wrapping algorithms
    for different languages, how would you keep that information?

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: UTF8 Squeak

Michael Rueger-4
Yoshiki Ohshima wrote:

>   Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
>   For storing the bare Unicode code points, I think so.  I'm not
> convinced that adding 16-bit variation solves any real problems.  But
> there may be something.

A lot of text is basically 8-bit, *except* for the occasional wide dash
etc, blowing up the text to 32-bit although 16 would be more than enough.

>   - Suppose you would like to use different line wrapping algorithms
>     for different languages, how would you keep that information?

The question is which, if any, language dependent (text layout?!)
attributes should be encoded into the String rather than kept as text
attributes.

Michael

1234567