Smalltalk › Squeak › Squeak - Dev

[squeak-dev] WideString UTF-8, UTF-32, UCS2

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

1 message

Vladimir Pogorelenko-2

[squeak-dev] WideString UTF-8, UTF-32, UCS2

I'm trying to deal with different string encodings in my image.

I've read some related posts but didn't find direct answers.

For the test I took unicode word 'привет'. Trying to input this string
from keyboard, seaside web form and file stream I got 2 different
formats:

FIRST FORMAT: comes from Keyboard Input, Seaside with WAKomEncoded
WideString
1: 1087.
2: 1088.
...
What is the format of that String, I guess it's exactly UTF-8.

SECOND FORMAT: comes from FileStream, FileIn, etc.
WideString
1: 1069548607.
2: 1069548608.
...
The same question what is it? Is it UTF-32 or UCS2?

Both string are displayed correctly, but I'm failed to compare it.

So the questions are,
1. How to load data from files (e.g. FileStream) in first format
(UTF-8?). I also need to do that for loading source code which
contains unicode String's. May be I need to subclass UTF8TextConverter
and call it UTF8ToUTF8TextConverter.
2. How to setup WAKomEncoded and chars from keyboard to come in
second format.
3. What the encoding to choose as the base? What is the blueprint for
it? I guess I just need learn how to load data in FIRST FORMAT and all
will be ok.
4. How to convert WideString in image from one format to another.

Unicode problem is still live here in Squeak :-) I'm confused how some
great products like CMSBox fight against it. May be they don't even
need to load data from external streams.

I'm using squeak-dev 3.9 image with installed UnicodeSupport (http://www.nabble.com/Re%3A--squeak-dev---ANN--3.10-final-is-out-p16182045.html
) to input unicode chars from keyboard. I'm on Mac. Don't even know
what would be when I try to run in under Windows.