Smalltalk › Squeak › Squeak - Dev

New Win32 VM [m17n testers needed]

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

124 messages Options

1234567

Lukas Renggli

Re: Image as a database (was Re: UTF8 Squeak)

> I admit that I came from VW where I'm running quite a number of web apps
> on images which servers also as a sole database and that just works,
> reliable and fast.
>
> Now I'm thinking to do the same in Squeak. That is, to use squeak image
> as a database, fast and reliable. Am I too naive?

A lot of people are doing that with Squeak as well. For example all my
Pier instances are using the image as a database. It is (was) the same
for SmallWiki.

It is certainly fast. The reliability depends on the VM you are using.
Some VMs are known to crash if they have been started with the wrong
parameters or if they grow over a certain amount of memory.

Lukas

--
Lukas Renggli
http://www.lukas-renggli.ch

Alan L. Lovejoy

RE: UTF8 Squeak

In reply to this post by Philippe Marschall

<Philippe>Well, there is what the evil language with J does: UCS2
everywhere, no excuses. This is a bit awkward for characters outside the BMP
(which are more rare than unicorns) but IIRC the astral planes didn't exits
when it was created. So you could argue for UCS4. Yes, it's twice the size,
but who really cares? If you could get rid of all the size hacks in Squeak
that were cool in the 70ies, would you?</Philippe>

Note: UTF-32 and UCS-4 are different names for the same thing [Reference:
http://en.wikipedia.org/wiki/UTF-32]

There is no one solution that is good enough for all use cases.

UTF-32 is fast for indexed chacter reading/writing. It also comprehensively
covers the entire Unicode Universal Character Set--not just those in the
Basic Multilingual Plane. But it also is not very space efficient.
[Reference: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters.]

Although you could have a different subclass of String for each encoding,
that's a poor use of inheritance. It's better to have a single String class
that uses an associated Strategy object (stored in one of the instance
variables of a String--the other holding a ByteArray containing the
characters.) The CharacterEncoding class would have a subclass for each
different encoding. The byteArray would hold the String's data, whose
character content would be interpreted by the Strategy object (an instance
of CharacterEncoding.)

To achieve semantic unification across any and all character encodings, the
rule would be that when a Character object is reified from a String, it
always uses the Unicode code point ("integer code value.") And when a
Character is "put:" into a String, its canonical (Unicode) code point is
translated to be correct for that String's encoding. Both conversions would
be the responsibility of the String's Strategy object (an instance of
CharacterEncoding.)

This implementation architecture lets each application (or
package/module/code-library) choose the encoding that best suits its use
case, but prevents character code mapping errors when characters are copied
between Strings whose encodings are not the same.

In the case of the variable-byte encodings, it might be possible to achieve
significant performance improvements by having the CharacterEncoding
instance store information that helps to more quickly translate between
logical character indices and physical byte indices within the String's
ByteArray (the RunArray of a Text is a good analogy for what I have in mind
here.)

--Alan

Ramiro Diaz Trepat

Re: Image as a database (was Re: UTF8 Squeak)

In reply to this post by Lukas Renggli

If you run applications that don't need to execute transactions or
care a lot to lose some part of the objects if the application
crashes, and you also find a nice way to share the instances of your
object model among different running instances (very common in Seaside
to run several images). Although I have never used VW, it is probably
not a lot different, since Seaside apps can be pretty stable.
But if you need to execute "persistent" transactions, then take a look
at Magma, Glorp, GOODS or even the free version of GemStone.

r.

On 6/9/07, Lukas Renggli <[hidden email]> wrote:

> > I admit that I came from VW where I'm running quite a number of web apps
> > on images which servers also as a sole database and that just works,
> > reliable and fast.
> >
> > Now I'm thinking to do the same in Squeak. That is, to use squeak image
> > as a database, fast and reliable. Am I too naive?
>
> A lot of people are doing that with Squeak as well. For example all my
> Pier instances are using the image as a database. It is (was) the same
> for SmallWiki.
>
> It is certainly fast. The reliability depends on the VM you are using.
> Some VMs are known to crash if they have been started with the wrong
> parameters or if they grow over a certain amount of memory.
>
> Lukas
>
> --
> Lukas Renggli
> http://www.lukas-renggli.ch
>
>

tblanchard

Re: UTF8 Squeak

In reply to this post by K. K. Subramaniam

FWIW, I believe this is how NSString works on the Mac and it has been unicode capable for a very long time.

On Jun 9, 2007, at 1:54 PM, Alan Lovejoy wrote:

It's better to have a single String class

that uses an associated Strategy object (stored in one of the instance

variables of a String--the other holding a ByteArray containing the

characters.)

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

Janko Mivšek wrote:
> I would propose a hibrid solution: three subclasses of String:
>
> 1. ByteString for ASCII (native english speakers
> 2. TwoByteString for most of other languages
> 3. FourByteString(WideString) for Japanese/Chinese/and others

Let me be more exact about that proposal:

This is for internal representation only, for interfacing to external
world we need to convert to/from (at least) UTF-8 representation.

1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
ByteString is therefore always regarded as encoded in ISO8859-1
codepage, which is the same as Unicode Basic Latin (1).

2. TwoByteString for East European Latin, Greek, Cyrillic and many more
(so called Basic Multilingual Pane (2)). Encoding of that string
would correspond to UCS-2, even that it is considered obsolete (3)

3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
of that string would therefore correspond to UCS-4/UTF-32 (4)

I think that this way we can achieve most efficient yet fast support for
all languages on that world. Because of fixed length those strings are
also easy to manipulate contrary to variable length UTF-8 ones.

Conversion to/form UTF-8 could probably also be simpler with help of bit
arithmetic algorithms, which would be tailored differently for each of
proposed three string subclasses above.

(1) Wikipedia Unicode: Storage, transfer, and processing
http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
(2) Wikipedia Basic Multilingual Pane
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
(3) Wikipedia UTF-16/UCS-2:
http://en.wikipedia.org/wiki/UCS-2
(4) Wikipedia UTF-32/UCS-4
http://en.wikipedia.org/wiki/UTF-32/UCS-4

Best regards
Janko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Colin Putney

Re: UTF8 Squeak

On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:

> I think that this way we can achieve most efficient yet fast
> support for all languages on that world. Because of fixed length
> those strings are also easy to manipulate contrary to variable
> length UTF-8 ones.

"Most efficient yet fast" is a matter of perspective. For the apps I
work on, UTF-8 is better than your scheme because space efficiency is
more important than random access, and time spent encoding and
decoding UTF-8 would dwarf time spent scanning for random access.

As soon as you try to support more than 256 characters, there are
trade-offs to be made. The "ideal" solution depends on your
application. How important is memory efficiency vs. space efficiency?
How about stream processing vs random access? What format is your
input and output? Which characters do you need to support, and how
many of them are there?

A good string library will be flexible enough to allow its users to
make those trade-offs according to the needs of the application.

> Conversion to/form UTF-8 could probably also be simpler with help
> of bit arithmetic algorithms, which would be tailored differently
> for each of proposed three string subclasses above.

Yes, a couple of well designed primitives would help quite a bit.

Colin

K. K. Subramaniam

Re: UTF8 Squeakn

In reply to this post by K. K. Subramaniam

On Saturday 09 June 2007 6:59 am, Yoshiki Ohshima wrote:
> It is incomplete in many ways. Sure. But that wasn't the issue you
> were raising; you were talking about the interface between the image
> and VM but the hard part.
If multilingual support is implemented in platform-specific fashion, then it
will be incomplete till all platform variations are taken into account.
>> a) Use Unicode chars in literals and text fields.
> You can do this already.
I didn't know this. Neither typing it nor cut-n-paste would work for me (VM
3.7.7 on Linux). I am able to use U+0Cxx characters fine in other Linux apps
(UTF-8 aware). Am I missing something in my Squeak setup?
> We have been doing this many years already. What we can't do is to
> display Indic characters yet (which will be solved very soon).
Is it working on LInux? for Kannada? I have some public schools that are
waiting for Kannada support. Is there anything I can do to speed it up?
>> f) See 'current language' indicator in input fields.
> What do you mean by "input fields"?
Essentially, Morphs that accept text input. In X/KDE/GNOME, there is a
keyboard layout indicator that shows what keyboard layout is in effect. But
how can a user get hints when Squeak runs fullscreen or in console mode?

Regards .. Subbu

Andreas.Raab

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

Hi Janko -

Just as a comment from the sidelines I think that concentrating on the
size of the character in an encoding is a mistake. It is really the
encoding that matters and if it weren't impractical I would rename
ByteString to Latin1String and WideString to UTF32String or so.

This makes it much clearer that we are interested more in the encoding
than the number of bytes per character (although of course some
encodings imply a character size) and this "encoding-driven" view of
strings makes it perfectly natural to think of an UTF8String which has a
variable sized encoding and can live in perfect harmony with the other
"byte encoded strings".

In your case, I would rather suggest having a class UTF16String instead
of TwoByteString. A good starting point (if you are planning to spend
any time on this) would be to create a class EncodedString which
captures the basics of conversions between differently encoded strings
and start defining a few (trivial) subclasses like mentioned above. From
there, you could extend this to UTF-8, UTF-16 and whatever else encoding
you need.

Cheers,
- Andreas

Janko Mivšek wrote:

> Janko Mivšek wrote:
>> I would propose a hibrid solution: three subclasses of String:
>>
>> 1. ByteString for ASCII (native english speakers
>> 2. TwoByteString for most of other languages
>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>
> Let me be more exact about that proposal:
>
> This is for internal representation only, for interfacing to external
> world we need to convert to/from (at least) UTF-8 representation.
>
> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
> ByteString is therefore always regarded as encoded in ISO8859-1
> codepage, which is the same as Unicode Basic Latin (1).
>
> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
> (so called Basic Multilingual Pane (2)). Encoding of that string
> would correspond to UCS-2, even that it is considered obsolete (3)
>
> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
> of that string would therefore correspond to UCS-4/UTF-32 (4)
>
>
> I think that this way we can achieve most efficient yet fast support for
> all languages on that world. Because of fixed length those strings are
> also easy to manipulate contrary to variable length UTF-8 ones.
>
> Conversion to/form UTF-8 could probably also be simpler with help of bit
> arithmetic algorithms, which would be tailored differently for each of
> proposed three string subclasses above.
>
>
> (1) Wikipedia Unicode: Storage, transfer, and processing
> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
> (2) Wikipedia Basic Multilingual Pane
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> (3) Wikipedia UTF-16/UCS-2:
> http://en.wikipedia.org/wiki/UCS-2
> (4) Wikipedia UTF-32/UCS-4
> http://en.wikipedia.org/wiki/UTF-32/UCS-4
>
> Best regards
> Janko
>

Michael van der Gulik-2

Re: Image as a database (was Re: UTF8 Squeak)

In reply to this post by Janko Mivšek

On 6/10/07, Janko Mivšek <[hidden email]> wrote:

Hi Phillipe,

Philippe Marschall wrote:
> 2007/6/9, Janko Mivšek <[hidden email]>:
>> Philippe Marschall wrote:
>> All of us who use image as a database care about space efficiency but on
>> the other side we want all normal string operations to run on unicode
>> strings too.
>
> The image is not an efficient database. It stores all kind of "crap"
> like Morphs.

The crap is going to be there anyway if you're running some sort of live image with a database attached. If you don't want the crap then you can remove it from the image or use a minimal image.

Now I'm thinking to do the same in Squeak. That is, to use squeak image
as a database, fast and reliable. Am I too naive?

You have to have plans for when an image becomes corrupted.

I don't think it's too difficult to have the image automatically save itself - say for example once every 5 minutes, once an hour, once a day and once a week. That way if an image becomes corrupted, you can go back to the latest known working image.

If data is very important to your application, you could write some code to log changes to a file (aka squeak.changes) and manually read them back in when disaster occurs.

Michael.

Philippe Marschall

Re: Image as a database (was Re: UTF8 Squeak)

2007/6/11, Michael van der Gulik <[hidden email]>:

>
>
> On 6/10/07, Janko Mivšek <[hidden email]> wrote:
> > Hi Phillipe,
> >
> > Philippe Marschall wrote:
> > > 2007/6/9, Janko Mivšek <[hidden email]>:
> > >> Philippe Marschall wrote:
> > >> All of us who use image as a database care about space efficiency but
> on
> > >> the other side we want all normal string operations to run on unicode
> > >> strings too.
> > >
> > > The image is not an efficient database. It stores all kind of "crap"
> > > like Morphs.
>
> The crap is going to be there anyway if you're running some sort of live
> image with a database attached. If you don't want the crap then you can
> remove it from the image or use a minimal image.
>
> > Now I'm thinking to do the same in Squeak. That is, to use squeak image
> > as a database, fast and reliable. Am I too naive?
> >
>
> You have to have plans for when an image becomes corrupted.
>
> I don't think it's too difficult to have the image automatically save itself
> - say for example once every 5 minutes, once an hour, once a day and once a
> week. That way if an image becomes corrupted, you can go back to the latest
> known working image.
>
> If data is very important to your application, you could write some code to
> log changes to a file (aka squeak.changes) and manually read them back in
> when disaster occurs.

AKA transaction log DIY.

Philippe

> Michael.
>
>
>
>

K. K. Subramaniam

Re: UTF8 Squeak

In reply to this post by Andreas.Raab

On Monday 11 June 2007 2:44 am, Andreas Raab wrote:
> Hi Janko -
>.... this "encoding-driven" view of
> strings makes it perfectly natural to think of an UTF8String which has a
> variable sized encoding and can live in perfect harmony with the other
> "byte encoded strings".
How about UTF8Stream since UTF8 works best in a stream? The term String has
strong connotations of storage and indexing.

Regards .. Subbu

Klaus D. Witzel

Re: UTF8 Squeak

On Mon, 11 Jun 2007 09:15:16 +0200, subbukk wrote:
> On Monday 11 June 2007 2:44 am, Andreas Raab wrote:
>> Hi Janko -
>> .... this "encoding-driven" view of
>> strings makes it perfectly natural to think of an UTF8String which has a
>> variable sized encoding and can live in perfect harmony with the other
>> "byte encoded strings".
> How about UTF8Stream since UTF8 works best in a stream? The term String
> has strong connotations of storage and indexing.

How about the ages old Smalltalk solution: a decorated string is a Text.
The stringArray and the runArray of Text together determine the face of
the character. Then you'd have UTF8Text, the stringArray of which can
contain plain ASCII if there are no other characters in a UTF8Text. And
Dan's double dispatch concept can be employed for the match/replace/etc
functions.

Just my CHF 0.05

/Klaus

> Regards .. Subbu
>
>

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Andreas.Raab

Hi Andreas,

Let me start with a statement that Unicode is a generalization of ASCII.
ASCII has code points < 128 and therefore always fits in one byte while
Unicode can have 2, 3 or even 4 bytes wide code points.

No one treats ASCII strings as ASCII "encoded" therefore no one should
treat Unicode strings as encoded too. And this is an idea behind my
proposal - to have Unicode strings as collections of character code
points, with different byte widths.

Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which
all fit to one byte. ByteStrings which contain plain ASCII are
therefore already Unicode! Same with Latin 1 ones. It is therefore just
natural to extend Unicode from byte to two and four byte strings to
cover all code points. For an user this string is still a string as it
was when it was just ASCII. This approach is therefore also most
consistent one.

When we are talking about Unicode "encodings" we mean UTF (Unicode
Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
are both variable length formats, which means that character size is not
the same as byte size and it cannot be just simply calculated from it.
Each character character may be 1, 2, 3 or 4 bytes depending of the
width of its code point.

Because of variable length those encodings are not useful for general
string manipulation bit just for communication and storage. String
manipulation would be very inefficient (just consider the speed of
#size, which is used everywhere).

I would therefore use strings with pure Unicode content internally and
put all encoding/decoding on the periphery of the image - to interfaces
to the external world. As Subbukk already suggested we could put that
to an UTF8Stream?

VW and Gemstone also put encodings out of string, to separate Encoders
and the EncodedStream. They are also depreciating usage of
EncodedByteStrings like ISO88591String, MACString etc. Why should then
introduce them to Squeak now?

UT8 encoding/decoding is very efficient by design, therefore we must
make it efficient in Squeak too. It must be almost as fast as a simple copy.

And for those who still want to have UTF8 encoded string they can store
them in plain ByteString anyway...

I hope this clarify my ideas a bit.

Best regards
Janko

Andreas Raab wrote:

> Hi Janko -
>
> Just as a comment from the sidelines I think that concentrating on the
> size of the character in an encoding is a mistake. It is really the
> encoding that matters and if it weren't impractical I would rename
> ByteString to Latin1String and WideString to UTF32String or so.
>
> This makes it much clearer that we are interested more in the encoding
> than the number of bytes per character (although of course some
> encodings imply a character size) and this "encoding-driven" view of
> strings makes it perfectly natural to think of an UTF8String which has a
> variable sized encoding and can live in perfect harmony with the other
> "byte encoded strings".
>
> In your case, I would rather suggest having a class UTF16String instead
> of TwoByteString. A good starting point (if you are planning to spend
> any time on this) would be to create a class EncodedString which
> captures the basics of conversions between differently encoded strings
> and start defining a few (trivial) subclasses like mentioned above. From
> there, you could extend this to UTF-8, UTF-16 and whatever else encoding
> you need.
>
> Cheers,
> - Andreas
>
> Janko Mivšek wrote:
>> Janko Mivšek wrote:
>>> I would propose a hibrid solution: three subclasses of String:
>>>
>>> 1. ByteString for ASCII (native english speakers
>>> 2. TwoByteString for most of other languages
>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>
>> Let me be more exact about that proposal:
>>
>> This is for internal representation only, for interfacing to external
>> world we need to convert to/from (at least) UTF-8 representation.
>>
>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>> ByteString is therefore always regarded as encoded in ISO8859-1
>> codepage, which is the same as Unicode Basic Latin (1).
>>
>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>> (so called Basic Multilingual Pane (2)). Encoding of that string
>> would correspond to UCS-2, even that it is considered obsolete (3)
>>
>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>> of that string would therefore correspond to UCS-4/UTF-32 (4)
>>
>>
>> I think that this way we can achieve most efficient yet fast support
>> for all languages on that world. Because of fixed length those strings
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>>
>> Conversion to/form UTF-8 could probably also be simpler with help of
>> bit arithmetic algorithms, which would be tailored differently for
>> each of proposed three string subclasses above.
>>
>>
>> (1) Wikipedia Unicode: Storage, transfer, and processing
>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
>>
>> (2) Wikipedia Basic Multilingual Pane
>> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>> (3) Wikipedia UTF-16/UCS-2:
>> http://en.wikipedia.org/wiki/UCS-2
>> (4) Wikipedia UTF-32/UCS-4
>> http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>
>> Best regards
>> Janko
>>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Janko Mivšek

Re: UTF8 Squeak

In reply to this post by Colin Putney

Hi Colin,

Colin Putney wrote:

>
> On Jun 10, 2007, at 3:55 AM, Janko Mivšek wrote:
>
>> I think that this way we can achieve most efficient yet fast support
>> for all languages on that world. Because of fixed length those strings
>> are also easy to manipulate contrary to variable length UTF-8 ones.
>
> "Most efficient yet fast" is a matter of perspective. For the apps I
> work on, UTF-8 is better than your scheme because space efficiency is
> more important than random access, and time spent encoding and decoding
> UTF-8 would dwarf time spent scanning for random access.

Anyone can definitively stay with UTF8 encoded strings in plan BytString
or subclass to UTF8String by himself. But I don't know why we need to
have UTF8String as part of string framework. Just because of meaning?
Then we also need to introduce an ASCIIString :)

> As soon as you try to support more than 256 characters, there are
> trade-offs to be made. The "ideal" solution depends on your application.
> How important is memory efficiency vs. space efficiency? How about
> stream processing vs random access? What format is your input and
> output? Which characters do you need to support, and how many of them
> are there?
> A good string library will be flexible enough to allow its users to
> make those trade-offs according to the needs of the application.

I think that preserving simplicity is also an important goal. We need to
find a general yet simple solution for Unicode Strings, which will be
good enough for most uses, as is the case for numbers for instance. We
deal with more special cases separately. I claim that pure Unicode
strings in Byte, TwoByte or FourByteString is such a general support.
UTF8String is already a specific one.
>
>> Conversion to/form UTF-8 could probably also be simpler with help of
>> bit arithmetic algorithms, which would be tailored differently for
>> each of proposed three string subclasses above.
>
> Yes, a couple of well designed primitives would help quite a bit.

I study UTF8 conversion and it is designed to be efficient, almost as
usual copy. I already did those conversion methods and now I'm preparing
for benchmarks. If conversion is really as fast as a copy, then there
are really not much arguments anymore to convert always to inner Unicode
by default?

Best regards
JAnko

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

Bert Freudenberg

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

So except for the missing 16-bit optimization this is exactly as we
do have now, right? So what is the actual proposal?

- Bert -

On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:

> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of
> ASCII. ASCII has code points < 128 and therefore always fits in one
> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one
> should treat Unicode strings as encoded too. And this is an idea
> behind my proposal - to have Unicode strings as collections of
> character code points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1)
> which all fit to one byte. ByteStrings which contain plain ASCII
> are therefore already Unicode! Same with Latin 1 ones. It is
> therefore just natural to extend Unicode from byte to two and four
> byte strings to cover all code points. For an user this string is
> still a string as it was when it was just ASCII. This approach is
> therefore also most consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First
> ones are both variable length formats, which means that character
> size is not the same as byte size and it cannot be just simply
> calculated from it. Each character character may be 1, 2, 3 or 4
> bytes depending of the width of its code point.
>
> Because of variable length those encodings are not useful for
> general string manipulation bit just for communication and storage.
> String manipulation would be very inefficient (just consider the
> speed of #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally
> and put all encoding/decoding on the periphery of the image - to
> interfaces to the external world. As Subbukk already suggested we
> could put that to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate
> Encoders and the EncodedStream. They are also depreciating usage of
> EncodedByteStrings like ISO88591String, MACString etc. Why should
> then introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we
> must make it efficient in Squeak too. It must be almost as fast as
> a simple copy.
>
> And for those who still want to have UTF8 encoded string they can
> store them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>
> Best regards
> Janko
>
> Andreas Raab wrote:
>> Hi Janko -
>> Just as a comment from the sidelines I think that concentrating on
>> the size of the character in an encoding is a mistake. It is
>> really the encoding that matters and if it weren't impractical I
>> would rename ByteString to Latin1String and WideString to
>> UTF32String or so.
>> This makes it much clearer that we are interested more in the
>> encoding than the number of bytes per character (although of
>> course some encodings imply a character size) and this "encoding-
>> driven" view of strings makes it perfectly natural to think of an
>> UTF8String which has a variable sized encoding and can live in
>> perfect harmony with the other "byte encoded strings".
>> In your case, I would rather suggest having a class UTF16String
>> instead of TwoByteString. A good starting point (if you are
>> planning to spend any time on this) would be to create a class
>> EncodedString which captures the basics of conversions between
>> differently encoded strings and start defining a few (trivial)
>> subclasses like mentioned above. From there, you could extend this
>> to UTF-8, UTF-16 and whatever else encoding you need.
>> Cheers,
>> - Andreas
>> Janko Mivšek wrote:
>>> Janko Mivšek wrote:
>>>> I would propose a hibrid solution: three subclasses of String:
>>>>
>>>> 1. ByteString for ASCII (native english speakers
>>>> 2. TwoByteString for most of other languages
>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>
>>> Let me be more exact about that proposal:
>>>
>>> This is for internal representation only, for interfacing to
>>> external world we need to convert to/from (at least) UTF-8
>>> representation.
>>>
>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>> ByteString is therefore always regarded as encoded in ISO8859-1
>>> codepage, which is the same as Unicode Basic Latin (1).
>>>
>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and
>>> many more
>>> (so called Basic Multilingual Pane (2)). Encoding of that string
>>> would correspond to UCS-2, even that it is considered obsolete
>>> (3)
>>>
>>> 3. FourByteString for Chinese/Japanese/Korean and some others.
>>> Encoding
>>> of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>
>>>
>>> I think that this way we can achieve most efficient yet fast
>>> support for all languages on that world. Because of fixed length
>>> those strings are also easy to manipulate contrary to variable
>>> length UTF-8 ones.
>>>
>>> Conversion to/form UTF-8 could probably also be simpler with help
>>> of bit arithmetic algorithms, which would be tailored differently
>>> for each of proposed three string subclasses above.
>>>
>>>
>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.
>>> 2C_and_processing
>>> (2) Wikipedia Basic Multilingual Pane
>>> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>> (3) Wikipedia UTF-16/UCS-2:
>>> http://en.wikipedia.org/wiki/UCS-2
>>> (4) Wikipedia UTF-32/UCS-4
>>> http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>
>>> Best regards
>>> Janko
>>>
>
>
> --
> Janko Mivšek
> AIDA/Web
> Smalltalk Web Application Server
> http://www.aidaweb.si
>

Janko Mivšek

Re: UTF8 Squeak

Bert Freudenberg wrote:
> So except for the missing 16-bit optimization this is exactly as we do
> have now, right? So what is the actual proposal?

Exactly. There is already a WideString and my proposal is just to
introduce a TwoByteString and rename WideString to FourByteString for
consistency.

That way we'll cover all Unicode strings as efficiently as possible yet
manageable with string manipulations like usual strings.

But main point of my proposal is to treat internal strings as Unicode
and only Unicode and nothing else. All other encodings must be converted
to Unicode at the borders of an image. Those conversions could be done
with separate Encoders or EncodedStreams.

It seems that this was already a Yoshiki idea with WideString, so I'm
just extending that idea with a TwoByteString to cover 16 bits too.

Yoshiki, am I right?

Best regards
Janko

>
> - Bert -
>
> On Jun 11, 2007, at 12:35 , Janko Mivšek wrote:
>
>> Hi Andreas,
>>
>> Let me start with a statement that Unicode is a generalization of
>> ASCII. ASCII has code points < 128 and therefore always fits in one
>> byte while Unicode can have 2, 3 or even 4 bytes wide code points.
>>
>> No one treats ASCII strings as ASCII "encoded" therefore no one should
>> treat Unicode strings as encoded too. And this is an idea behind my
>> proposal - to have Unicode strings as collections of character code
>> points, with different byte widths.
>>
>> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1)
>> which all fit to one byte. ByteStrings which contain plain ASCII are
>> therefore already Unicode! Same with Latin 1 ones. It is therefore
>> just natural to extend Unicode from byte to two and four byte strings
>> to cover all code points. For an user this string is still a string as
>> it was when it was just ASCII. This approach is therefore also most
>> consistent one.
>>
>> When we are talking about Unicode "encodings" we mean UTF (Unicode
>> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
>> are both variable length formats, which means that character size is
>> not the same as byte size and it cannot be just simply calculated from
>> it. Each character character may be 1, 2, 3 or 4 bytes depending of
>> the width of its code point.
>>
>> Because of variable length those encodings are not useful for general
>> string manipulation bit just for communication and storage. String
>> manipulation would be very inefficient (just consider the speed of
>> #size, which is used everywhere).
>>
>> I would therefore use strings with pure Unicode content internally and
>> put all encoding/decoding on the periphery of the image - to
>> interfaces to the external world. As Subbukk already suggested we
>> could put that to an UTF8Stream?
>>
>> VW and Gemstone also put encodings out of string, to separate Encoders
>> and the EncodedStream. They are also depreciating usage of
>> EncodedByteStrings like ISO88591String, MACString etc. Why should then
>> introduce them to Squeak now?
>>
>> UT8 encoding/decoding is very efficient by design, therefore we must
>> make it efficient in Squeak too. It must be almost as fast as a simple
>> copy.
>>
>> And for those who still want to have UTF8 encoded string they can
>> store them in plain ByteString anyway...
>>
>> I hope this clarify my ideas a bit.
>>
>> Best regards
>> Janko
>>
>> Andreas Raab wrote:
>>> Hi Janko -
>>> Just as a comment from the sidelines I think that concentrating on
>>> the size of the character in an encoding is a mistake. It is really
>>> the encoding that matters and if it weren't impractical I would
>>> rename ByteString to Latin1String and WideString to UTF32String or so.
>>> This makes it much clearer that we are interested more in the
>>> encoding than the number of bytes per character (although of course
>>> some encodings imply a character size) and this "encoding-driven"
>>> view of strings makes it perfectly natural to think of an UTF8String
>>> which has a variable sized encoding and can live in perfect harmony
>>> with the other "byte encoded strings".
>>> In your case, I would rather suggest having a class UTF16String
>>> instead of TwoByteString. A good starting point (if you are planning
>>> to spend any time on this) would be to create a class EncodedString
>>> which captures the basics of conversions between differently encoded
>>> strings and start defining a few (trivial) subclasses like mentioned
>>> above. From there, you could extend this to UTF-8, UTF-16 and
>>> whatever else encoding you need.
>>> Cheers,
>>> - Andreas
>>> Janko Mivšek wrote:
>>>> Janko Mivšek wrote:
>>>>> I would propose a hibrid solution: three subclasses of String:
>>>>>
>>>>> 1. ByteString for ASCII (native english speakers
>>>>> 2. TwoByteString for most of other languages
>>>>> 3. FourByteString(WideString) for Japanese/Chinese/and others
>>>>
>>>> Let me be more exact about that proposal:
>>>>
>>>> This is for internal representation only, for interfacing to
>>>> external world we need to convert to/from (at least) UTF-8
>>>> representation.
>>>>
>>>> 1. ByteString for ASCII (English) and ISO-8859-1 (West Europe).
>>>> ByteString is therefore always regarded as encoded in ISO8859-1
>>>> codepage, which is the same as Unicode Basic Latin (1).
>>>>
>>>> 2. TwoByteString for East European Latin, Greek, Cyrillic and many more
>>>> (so called Basic Multilingual Pane (2)). Encoding of that string
>>>> would correspond to UCS-2, even that it is considered obsolete (3)
>>>>
>>>> 3. FourByteString for Chinese/Japanese/Korean and some others. Encoding
>>>> of that string would therefore correspond to UCS-4/UTF-32 (4)
>>>>
>>>>
>>>> I think that this way we can achieve most efficient yet fast support
>>>> for all languages on that world. Because of fixed length those
>>>> strings are also easy to manipulate contrary to variable length
>>>> UTF-8 ones.
>>>>
>>>> Conversion to/form UTF-8 could probably also be simpler with help of
>>>> bit arithmetic algorithms, which would be tailored differently for
>>>> each of proposed three string subclasses above.
>>>>
>>>>
>>>> (1) Wikipedia Unicode: Storage, transfer, and processing
>>>> http://en.wikipedia.org/wiki/Unicode#Storage.2C_transfer.2C_and_processing
>>>>
>>>> (2) Wikipedia Basic Multilingual Pane
>>>> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
>>>> (3) Wikipedia UTF-16/UCS-2:
>>>> http://en.wikipedia.org/wiki/UCS-2
>>>> (4) Wikipedia UTF-32/UCS-4
>>>> http://en.wikipedia.org/wiki/UTF-32/UCS-4
>>>>
>>>> Best regards
>>>> Janko
>>>>
>>
>>
>> --Janko Mivšek
>> AIDA/Web
>> Smalltalk Web Application Server
>> http://www.aidaweb.si
>>
>
>
>
>

--
Janko Mivšek
AIDA/Web
Smalltalk Web Application Server
http://www.aidaweb.si

NorbertHartl

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

On Mon, 2007-06-11 at 12:35 +0200, Janko Mivšek wrote:

> Hi Andreas,
>
> Let me start with a statement that Unicode is a generalization of ASCII.
> ASCII has code points < 128 and therefore always fits in one byte while
> Unicode can have 2, 3 or even 4 bytes wide code points.
>
> No one treats ASCII strings as ASCII "encoded" therefore no one should
> treat Unicode strings as encoded too. And this is an idea behind my
> proposal - to have Unicode strings as collections of character code
> points, with different byte widths.
>
> Unicode actually starts with ASCII, then with Latin 1 (ISO8859-1) which
> all fit to one byte. ByteStrings which contain plain ASCII are
> therefore already Unicode! Same with Latin 1 ones. It is therefore just
> natural to extend Unicode from byte to two and four byte strings to
> cover all code points. For an user this string is still a string as it
> was when it was just ASCII. This approach is therefore also most
> consistent one.
>
> When we are talking about Unicode "encodings" we mean UTF (Unicode
> Transformation Format). There is UTF-8, UTF-16 and UTF-32. First ones
> are both variable length formats, which means that character size is not
> the same as byte size and it cannot be just simply calculated from it.
> Each character character may be 1, 2, 3 or 4 bytes depending of the
> width of its code point.
>
> Because of variable length those encodings are not useful for general
> string manipulation bit just for communication and storage. String
> manipulation would be very inefficient (just consider the speed of
> #size, which is used everywhere).
>
> I would therefore use strings with pure Unicode content internally and
> put all encoding/decoding on the periphery of the image - to interfaces
> to the external world. As Subbukk already suggested we could put that
> to an UTF8Stream?
>
> VW and Gemstone also put encodings out of string, to separate Encoders
> and the EncodedStream. They are also depreciating usage of
> EncodedByteStrings like ISO88591String, MACString etc. Why should then
> introduce them to Squeak now?
>
> UT8 encoding/decoding is very efficient by design, therefore we must
> make it efficient in Squeak too. It must be almost as fast as a simple copy.
>
> And for those who still want to have UTF8 encoded string they can store
> them in plain ByteString anyway...
>
> I hope this clarify my ideas a bit.
>

Yes, absolutely. And this time I like to fully agree in public :)

Norbert

Colin Putney

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

On Jun 11, 2007, at 4:04 AM, Janko Mivšek wrote:

> Anyone can definitively stay with UTF8 encoded strings in plan
> BytString or subclass to UTF8String by himself. But I don't know
> why we need to have UTF8String as part of string framework. Just
> because of meaning? Then we also need to introduce an ASCIIString :)

> I think that preserving simplicity is also an important goal. We
> need to find a general yet simple solution for Unicode Strings,
> which will be good enough for most uses, as is the case for numbers
> for instance. We deal with more special cases separately. I claim
> that pure Unicode strings in Byte, TwoByte or FourByteString is
> such a general support. UTF8String is already a specific one.

Ok, so what you're saying is this: ByteString, TwoByteString and
FourByteString are good enough for the most purposes. Web developers
and anyone else that needs to work with other encodings should roll
their own solutions, so as not to burden the rest of the community
with clutter caused by support for other encodings, or even hooks to
make such things easy to integrate with the base string code.

Is that a fair characterization of your position?

Colin

Yoshiki Ohshima

Re: UTF8 Squeak

In reply to this post by Janko Mivšek

Janko,

> It seems that this was already a Yoshiki idea with WideString, so I'm
> just extending that idea with a TwoByteString to cover 16 bits too.
>
> Yoshiki, am I right?

For storing the bare Unicode code points, I think so. I'm not
convinced that adding 16-bit variation solves any real problems. But
there may be something.

My first a few questions are:

- While vast majority of strings for, say, Japanese can be
represented with in the characters in BMP, you would use
FourByteString for Chinese/Japanese/Korean and some others. Does
this mean that you would *always* use FourByteString for these
"languages" (and not scripts?)

- Suppose you would like to use different line wrapping algorithms
for different languages, how would you keep that information?

-- Yoshiki

Michael Rueger-4

Re: UTF8 Squeak

Yoshiki Ohshima wrote:

> Janko,
>
>> It seems that this was already a Yoshiki idea with WideString, so I'm
>> just extending that idea with a TwoByteString to cover 16 bits too.
>>
>> Yoshiki, am I right?
>
> For storing the bare Unicode code points, I think so. I'm not
> convinced that adding 16-bit variation solves any real problems. But
> there may be something.

A lot of text is basically 8-bit, *except* for the occasional wide dash
etc, blowing up the text to 32-bit although 16 would be more than enough.

> - Suppose you would like to use different line wrapping algorithms
> for different languages, how would you keep that information?

The question is which, if any, language dependent (text layout?!)
attributes should be encoded into the String rather than kept as text
attributes.

Michael

1234567