Smalltalk › Squeak › Squeak - Dev

[squeak-dev] WideString UTF-8, UTF-32, UCS2

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Vladimir Pogorelenko

[squeak-dev] WideString UTF-8, UTF-32, UCS2

I'm trying to deal with different string encodings in my image.

I've read some related posts but didn't find direct answers.

For the test I took unicode word 'привет'. Trying to input this string from keyboard, seaside web form and file stream I got 2 different formats:

FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
WideString
1: 1087.
2: 1088.
...
What is the format of that String, I guess it's exactly UTF-8.

SECOND FORMAT: comes from FileStream, FileIn, etc.
WideString
1: 1069548607.
2: 1069548608.
...
The same question what is it? Is it UTF-32 or UCS2?

Both string are displayed correctly, but I'm failed to compare it.

So the questions are,
1. How to load data from files (e.g. FileStream) in first format (UTF-8?). I also need to do that for loading source code which contains unicode String's. May be I need to subclass UTF8TextConverter and call it UTF8ToUTF8TextConverter.
2. How to setup WAKomEncoded and chars from keyboard to come in second format.
3. What the encoding to choose as the base? What is the blueprint for it? I guess I just need learn how to load data in FIRST FORMAT and all will be ok.
4. How to convert WideString in image from one format to another.

Unicode problem is still live here in Squeak :-) I'm confused how some great products like CMSBox fight against it. May be they don't even need to load data from external streams.

I'm using squeak-dev 3.9 image with installed UnicodeSupport (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html) to input unicode chars from keyboard. I'm on Mac. Don't even know what would be when I try to run in under Windows.

Andreas.Raab

[squeak-dev] Re: WideString UTF-8, UTF-32, UCS2

Vladimir Pogorelenko wrote:
> FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
> WideString
> 1: 1087.
> 2: 1088.
> ...
> What is the format of that String, I guess it's exactly UTF-8.

It's UTF-32/UCS-4.

> SECOND FORMAT: comes from FileStream, FileIn, etc.
> WideString
> 1: 1069548607.
> 2: 1069548608.
> ...
> The same question what is it? Is it UTF-32 or UCS2?

It's UTF-32/UCS-4, too.

*Except* that it has a particular Squeak-idiosynchratic bit in in, the
"leading char". If you look at the hex values, you can see that:

1069548607 hex '16r3FC0043F'
1087 hex '16r43F'

So the base values are the same, only some high bits are different. Now
let's check this out:

(Character value: 1069548607) asUnicode
=> 1087
(Character value: 1087) asUnicode
=> 1087

(Character value: 1069548607) leadingChar
=> 255
(Character value: 1087) leadingChar
=> 0

So the difference is that one character is created with a different
"leadingChar" than the other one.

And therein lies the problem. Which is that the "leadingChar" is not a
part of the Unicode standard (and not exactly well-defined inside Squeak
either). So all conversions must be aware that they either have to strip
off the leadingChar or substitute it properly.

> 1. How to load data from files (e.g. FileStream) in first format
> (UTF-8?). I also need to do that for loading source code which contains
> unicode String's. May be I need to subclass UTF8TextConverter and call
> it UTF8ToUTF8TextConverter.

First of all, get your terminology straight. If you think that the first
example was in UTF-8 you're missing something big-time. Your first
example cannot possibly be UTF-8 because it uses characters out of the
byte range. In fact, if we compare your examples:

(Character value: 1069548607) asString squeakToUtf8 asByteArray
=> a ByteArray(208 191)
(Character value: 1087) asString squeakToUtf8 asByteArray
=> a ByteArray(208 191)

You will find that both will end up with the identical encoding (not
surprisingly since there is simply no room to stick the leadingChar
anywhere into UTF-8).

> I'm using squeak-dev 3.9 image with installed UnicodeSupport
> (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html)
> to input unicode chars from keyboard. I'm on Mac. Don't even know what
> would be when I try to run in under Windows.

You would need a 3.10.x VM for that and also a few fixes so 3.9 is
probably a no-go.

Cheers,
- Andreas

Philippe Marschall

Re: [squeak-dev] WideString UTF-8, UTF-32, UCS2

In reply to this post by Vladimir Pogorelenko

2008/4/6, Vladimir Pogorelenko <[hidden email]>:

> I'm trying to deal with different string encodings in my image.
>
> I've read some related posts but didn't find direct answers.
>
> For the test I took unicode word 'привет'. Trying to input this string from
> keyboard, seaside web form and file stream I got 2 different formats:
>
>
> FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
> WideString
> 1: 1087.
> 2: 1088.
> ...
> What is the format of that String, I guess it's exactly UTF-8.

Nope, UTF-8 would result in a ByteString.

> SECOND FORMAT: comes from FileStream, FileIn, etc.
> WideString
> 1: 1069548607.
> 2: 1069548608.
> ...
> The same question what is it? Is it UTF-32 or UCS2?

The same thing just with a language tag.

> Both string are displayed correctly, but I'm failed to compare it.

Sure, because they're in a different language.

> So the questions are,
> 1. How to load data from files (e.g. FileStream) in first format (UTF-8?).
> I also need to do that for loading source code which contains unicode
> String's. May be I need to subclass UTF8TextConverter and call it
> UTF8ToUTF8TextConverter.
> 2. How to setup WAKomEncoded and chars from keyboard to come in second
> format.

WAKomEncoded never sets the language tag because there is no way of
knowing the language of a user. Chars from keyboard in general set a
language tag. Due to String comparison taking language into account
they in general don't compare to equal.

Note that when setting up WAKomEncoded:
- make sure your Strings are Squeak encoded (language tags are ignored)
- either use a current version of Kom in Squeak 3.9 or use an old
version of Kom in Squeak 3.8 (the semantics of #unescapePercents have
changed)

> 3. What the encoding to choose as the base? What is the blueprint for it? I
> guess I just need learn how to load data in FIRST FORMAT and all will be ok.
> 4. How to convert WideString in image from one format to another.

#convertToEncoding: / #convertFromEncoding:

> Unicode problem is still live here in Squeak :-) I'm confused how some great
> products like CMSBox fight against it. May be they don't even need to load
> data from external streams.

I was not aware of CMSBox fighting Unicode especially since it uses
utf-8 just like DabbleDB. Maybe you could elaborate a bit.

Cheers
Philippe

> I'm using squeak-dev 3.9 image with installed UnicodeSupport
> (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html)
> to input unicode chars from keyboard. I'm on Mac. Don't even know what would
> be when I try to run in under Windows.
>
>
>

Vladimir Pogorelenko

Re: [squeak-dev] WideString UTF-8, UTF-32, UCS2

Andreas, Philippe, many thanks for the explanation and clarification,
it helps me a lot.

I think leadingChar/languageTag is doubtful. Apparently it could be
used to classify languages somewhere in the image. Nevermind.

Based on your explanations I considered to trim languageTag for now.

So I made UTF8PlainUnicodeTextConverter which trims languageTag on
input.

UTF8TextConverter subclass: #UTF8PlainUnicodeTextConverter
nextFromStream: aStream
|ch|
ch := super nextFromStream: aStream.
ch isNil ifTrue: [^nil].
^Character value: ch asUnicode.

And I've loaded my domain objects with help of it's converter from file.
May be it's not most right solution for today but at least it works now.

Great!