Smalltalk › Usenets › Dolphin Smalltalk

Japanese/Unicode in Dolphin GUI...

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

7 messages Options

Christopher J. Demers

Japanese/Unicode in Dolphin GUI...

Has anyone built a Dolphin GUI application that supports Japanese or any
other Unicode language? I am investigating translating our application into
Japanese. It looks like Dolphin uses SetWindowTextA to put text in text
boxes. It looks like I _may_ be able to take advantage of reflection to get
a UnicodeString to use SetWindowTextW. If I can set an appropriate font,
and get the Unicode text into the controls I think it may be OK. I would be
interested to hear if anyone has any experience with Unicode in Dolphin GUI
applications.

Chris

Chris Uppal-3

Re: Japanese/Unicode in Dolphin GUI...

Chris,

> Has anyone built a Dolphin GUI application that supports Japanese or any
> other Unicode language?

I haven't done anything like that, but I do have to interface with systems
using 16-bit char types. I wanted to add a couple of observations and
questions of my own.

I don't know if this may be of any help to you, Chris, but one thing that has
helped me a bit is that it turns out to be possible to create new instances of
Character that wrap Integer "code points" that are > 255. They don't have the
"singleton" property (of being pseudo-immediate and compared by #==) but they
do work, sort of... It'd be interesting to know if that hack is:
- a complete no-no.
- something that /may/ work, but it's at our own risk.
- something that OA might consider supporting in the future.

It'd help a great deal if UnicodeString were able to accept/answer 16-bit
Integers (or, better, 16-bit Characters as above) rather than inheriting
operations from String that refuse them. (It may be that this has changed
since I last looked).

It'd help a /great deal/ if UnicodeString were actually a Unicode string --
i.e. able to accept code points in the /full/ Unicode range. I suspect that
the class is actually intended to represent a UTF-16 encoded string, but I
don't know.

Ideally, I'd like to see (I don't claim this is practical) the String hierarchy
refactored into an abstract String, with concrete subclasses that stored data
in UTF-8, UCS-8, UTF-16, UCS-16, and "plain" (but big) UTF-32. As it is, we
basically have UCS-8 (String) and a rather crippled UCS-16 (UnicodeString), and
nothing else.

I'm not screaming for any of this, and I'm certainly not asking for it /now/,
but I would like to know where (if anywhere, yet) OA are thinking of taking
this aspect of Dolphin. Above all I'd like to be sure that we/they aren't
going to go down the Java route and introduce a brain-damaged[*] imitation of
Unicode that will be a major problem for years to come.

-- chris

[*] "brain-damaged" is an understatement, but if I really pushed to find the
right words to express the immeasurable idiocy of Java's "unicode" strings,
then I'd be banned for NG abuse....

Bill Schwab-2

Re: Japanese/Unicode in Dolphin GUI...

Chris,

> Ideally, I'd like to see (I don't claim this is practical) the String
hierarchy
> refactored into an abstract String, with concrete subclasses that stored
data
> in UTF-8, UCS-8, UTF-16, UCS-16, and "plain" (but big) UTF-32. As it is,
we
> basically have UCS-8 (String) and a rather crippled UCS-16
(UnicodeString), and
> nothing else.

Interesting. To date, the only "unicode" that I've seen is in doing serial
communications with an aging physiologic monitor, and with Windows' *U()
functions. It sure seems as though both treat "unicode" as doubling string
length in the faint hope that it would be useful some day. Do you agree?
Where does that fit in your list of string types?

> I'm not screaming for any of this, and I'm certainly not asking for it
/now/,
> but I would like to know where (if anywhere, yet) OA are thinking of
taking
> this aspect of Dolphin. Above all I'd like to be sure that we/they
aren't
> going to go down the Java route and introduce a brain-damaged[*] imitation
of
> Unicode that will be a major problem for years to come.

Have you looked at the evolving implementation for Squeak? I was applying
my usual intermittent pressure about Squeak's compiler "hijacking" $_ for
assignment (as an optional editor feature, knock yourself out, letting it
bleed into the compiler and sources - ouch!!), and Unicode was proposed as a
relatively painless and widely agreeable solution to the problem. To that,
I replied, (paraphrasing) "ok, but are you simply going to double the length
of every string?". I was fearing the worst and hoping for better. They
appear to have better in mind. Take a look. It was sounding as though
Squeak 3.8 or 4.0 should merge the effort. That is unlikely to be in time
to help D6, but hopefully it will be in time to help the following version.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]

Chris Uppal-3

Pondering Unicode (long) [was: Japanese/Unicode in Dolphin GUI...]

Bill, and anyone who is interested in a longish rant on Unicode,

> > Ideally, I'd like to see (I don't claim this is practical) the String
> > hierarchy refactored into an abstract String, with concrete subclasses
> > that stored data in UTF-8, UCS-8, UTF-16, UCS-16, and "plain" (but big)
> > UTF-32. As it is, we basically have UCS-8 (String) and a rather
> > crippled UCS-16 (UnicodeString), and nothing else.
>
> Interesting. To date, the only "unicode" that I've seen is in doing
> serial communications with an aging physiologic monitor, and with
> Windows' *U() functions. It sure seems as though both treat "unicode" as
> doubling string length in the faint hope that it would be useful some
> day. Do you agree? Where does that fit in your list of string types?

Well, nowhere really. Not as stated, anyway.

In case if help to clarify, here's a quick rundown on Unicode as I see it.

I'd welcome corrections, expansions, contradictions, etc, from anyone since my
own understanding of these issues is still incomplete. In a group with readers
all over the world, like this one, I'm sure there are people with much better
knowledge of this stuff than a poor Western European like myself. Still, FWIW,
here's my take on it:

[Please note that I am making no effort whatever to use the correct Unicode
terminology (which I find pretty baffling) except for the above Uxxx-n names]

The important first thing is to distinguish between the abstraction of a
string, and its concrete storage.

In Smalltalk terms that /could/ be expressed as a protocol <UnicodeString> (and
perhaps an abstract base class UnicodeString) which is an <ArrayedCollection>
of Unicode characters. "characters" here is something of a misnomer since the
elements of a Unicode string may not be (but often are) in 1-to-1
correspondence with what a speaker of the language would call its written
characters (if it has such things at all), but that's a separate issue and can
(I think) be ignored for these purposes. The "characters" themselves are
identified by their "code point" (hence the ANSI-standard method
Character>>codePoint) which is a positive integer in an undefined range. The
standard doesn't define the range, but there is no UTF-* encoding defined that
will handle >24 bits, and the Unicode people have (I believe) stated that they
will never define characters outside that range. If you read some uninformed
talk about Unicode then you might come away with the impression that Unicode
characters outside the 16-bit range are somehow "different" (there is talk of
"surrogate pairs"). This is incorrect, there is no difference whatever (I
think the fault ultimately lies with the misleading language used by the
Unicode people themselves.)

Now a <UnicodeString> holding UnicodeCharacters could be represented in lots of
ways. E.g. one could use a Dictionary mapping indexes to instances of
UnicodeCharacter. But the Unicode consortium define a few "standard" ways of
representing them as concrete sequences of bytes (the definitions are most
inherited from the ISO equivalent of Unicode). This is where the language of
Unicode gets really confusing, and I don't have it all straight by any means,
but this is a simplified view which I /hope/ is not misleading (or wrong !).

The "obvious" way to store <UnicodeString> is as a sequence of 32-bit numbers,
that encoding is known as UTF-32. It has the advantage (which no other
encoding has) that it can represent /all/ Unicode characters /and/ allow
constant time access to each one. It also has a nice simple implementation.
Unfortunately, it is very space inefficient, and -- for that reason -- tends
not to be used much.

The next most "obvious" way to store <UnicodeString>s would be as sequences of
3-byte, 24-bit, numbers. That would have most of the advantages of UTF-32 and
be rather less wasteful of space. It would have the disadvantage that access
to individual characters would not be aligned on 32-bit boundaries. If that
encoding existed, then it'd presumably be called UTF-24, but there is no such
format defined. I don't know why not; frankly, I find it puzzling.

The next easy option is just to say "To Hell with half the world's population;
we're going to pretend that all Unicode characters are in the 16-bit range".
That leads to an encoding as a sequence of 2-byte numbers called UCS-16. It
is impossible to represent characters with code points > 65535, and any attempt
to add such a character to a UCS-16 string would cause a runtime error. This
makes it impossible to represent most written Chinese properly, for instance,
although (IIRC) most indo-european languages can be represented this way.
IMHO, this particular option is the most brain-dead available, but is the one
chosen by the Java designers (though they hid it skilfully by misusing terms
like Unicode and UTF-8 all over the shop). That decision will be changed in
the forthcoming Java version; they will by fiat define that Java strings were
in UTF-16 all along. This will cause endless problems for Java programmers
which I am not going to describe since I'd start to shout and wave my arms
around, maybe also foam at the mouth or even go into fits. Oddly, the .NET
CLR seems to have the same problem -- at least the character type is defined as
a 16-bit quantity, which makes it useless for representing Unicode characters.

The /really/ easy approach is to go one further than UCS-16 and ignore code
points higher than 255. That means you can store a string as a sequence of
8-bit numbers (bytes). This encoding is called UCS-8. It is a tempting option
since it minimises space, and is /really/ simple to implement. The problem is
that it sort of misses the point of Unicode entirely... In a sense, Dolphin
already has support for this encoding in its normal String class (or
ByteArray).

The UCS-16 and UCS-8 encodings both have the advantage that it is easy to go
from a logical character position to the physical representation of that
character. The place where a character is stored does not depend on the
previous characters in the string, only on its own position. (Actually, that
is only superficially true, because of the point about non-1-to-1
correspondence I mentioned above, but that may not matter depending on the
application.) They both have the disadvantage that they can't represent the
entire defined range of UnicodeCharacters. The next two encodings reverse
those advantages and disadvantages.

Remember that we are talking about the concrete representation of
<UnicodeString>s in memory. The same abstract <UnicodeString> could be
represented as UTF-32, or (if it happened to use a limited range of characters)
UCS-16, or UCS-8.

Probably the most common encoding is UTF-8. This is a variable-width encoding;
characters with code points < 128 are represented by single bytes; other
characters are represented by "escape sequences" of up to five bytes. (I may
have the details wrong, but that's not important here). The encoding is such
that any code point up to 2**24 can be represented, and so any (current or
projected) Unicode character can be used in a UTF-8 string. The disadvantage
is that there is no longer a simple mapping from logical character positions to
the location of the corresponding number(s) in store. Hence the implementation
of UTF8String>>at: would have to scan the entire beginning part of the
byte-sequence to find the Nth character, so #at: and #at:put: would no longer
be constant-time operations. (That could be optimised somewhat, but at the
expense of even more complexity.) (Incidentally, the way that the encoding
works is rather clever, so that if you are looking at the raw bytes somewhere
in the middle of a UTF-8 string then you can always tell whether you are
looking at a directly encoded character or into an "escape sequence", and you
can always find the start of a current character by scanning backwards not more
than 4 bytes -- you don't have to go back to the beginning of the String).
For strings that mostly consist of characters with code points in the 0...127
range (essentially ASCII) UTF-8 can be nicely space-efficient, otherwise it can
be quite expensive.

UTF-16 is just like UTF-8 except that it uses 16-bit numbers instead of 8-bit.
As a result the escape sequences are less complicated than in UTF-8, and also
occur less often (and not at all for texts in some languages).

And that last point is where problems can start. Given an API which works in
terms of "wide" (16-bit) "characters" it can be difficult to tell whether the
sequences of 16-bit numbers are intended to be interpreted as UCS-16 or UTF-16.
The fact that in many cases (especially in the West) there is no obvious
difference makes people sloppy about the matter. I still have no idea whether
the Windows *W() functions use UTF-16 or UCS-16 (I admit that I haven't made
much effort to find out -- partly because I fear the worst: that it'd turn out
to be a mixture depending on the API/OS/version in use).

Actually, there /is/ a difference between UTF-16 and UCS-16 even for restricted
character ranges (and similarly for UTF-8 and UCS-8). In UCS-16 any random bit
pattern is a "valid" string in the sense that it is unambiguous (though it may
not contain defined Unicode characters). This is not true for UTF-16. One
reason is that some bit-sequences are "broken" escape sequences which cannot be
decoded according to the defined rules. Another reason is that some
bit-sequences are actually /forbidden/ and the implementation is required to
detect these cases and report an error. This is because they would otherwise
be interpreted as non-standard encodings of strings with simpler
representations (according to the rules) and that would make security even
harder than it already is (think of detecting ".."s in URLs, for instance; if
that character sequence had more than one encoding in Unicode then you could be
sure that someone would forget to check for all the cases sooner or later).
Despite this, it's still tempting to re-use an existing library (or API) that
is defined in terms of UCS-16 and just re-interpret it as UTF-16.

(BTW, Bill, those last two paragraphs are my best answer to your question that
I started with.)

One thing that isn't really relevant, but I think I should add is that all the
above is about the representation of <UnicodeString>s in memory where the
concept of a "number" is mostly unambiguous. In contexts where the bit-level
representation of numbers matters (e.g. when sending Unicode text over a
network, or to file) you also have to deal with whether the representation is
big-endian or little-endian. Unicode distinguishes between the external
representation and the internal and has UTF-16BE and UTF-16LE for the two
possible external flavours of UTF-16 (and similarly for UTF-32; ordering is not
an issue for UTF-8). It also uses a "byte order mark" (known as the BOM, or,
more affectionately, as a "non-breaking zero-width space") which can be (but
doesn't have to be) prepended to the external representation to indicate the
byte-order.

The real Unicode standard isn't very much about how strings should be
represented, the bulk of the standard is about defining characters, diacritic
marks, collating sequences, character classes, and all sorts of things that
matter to people reading the text. (Incidentally, and to be fair, the Java
designers have done a better job of mapping that stuff into Java than they have
of the storage issues) I don't expect Dolphin to provide out-of-the-box
wrappers for all that stuff -- even assuming that it is provided in Windows
somewhere. But I would like to understand (and be able to manipulate) the
various physical representations that I'm likely to come across in practise,
and to be able to tell /which/ representation is in use in a particular
context.

> Have you looked at the evolving implementation for Squeak? I was applying
> my usual intermittent pressure about Squeak's compiler "hijacking" $_ for
> assignment (as an optional editor feature, knock yourself out, letting it
> bleed into the compiler and sources - ouch!!), and Unicode was proposed
> as a relatively painless and widely agreeable solution to the problem.

I don't really see how Unicode provides a solution to this, but anyway I
haven't looked at it for a while.
I dont suppose I'll ever have much time for Squeak until:
a) I can take part in the community without using a blasted mailing list.
b) They drop $_ completely[*].
(and, of course:
c) I loose my remaining marbles and come to like the Squeak UI ;-)

[*] I cannot see any /reason/ why they haven't done this ages ago -- it seems
so simple to do, just change the sources on the SqueakMap and update the
compiler. Done.

> To that, I replied, (paraphrasing) "ok, but are you simply going to
> double the length of every string?". I was fearing the worst and hoping
> for better. They appear to have better in mind. Take a look. It was
> sounding as though Squeak 3.8 or 4.0 should merge the effort. That is
> unlikely to be in time to help D6, but hopefully it will be in time to
> help the following version.

Currently I have the 3.6 version (is there a pre-packaged download of anything
with their emerging Unicode support ?). As far as I can see from a quick look,
they'll face the same problem that Dolphin would. String is a concrete class
(and a variable byte class at that) which makes it difficult to introduce a
separation between abstraction and concrete representation(s). (Still, at
least they haven't pre-empted the name 'UnicodeString' to mean something
else ;-)

-- chris

Schwab,Wilhelm K

Re: Pondering Unicode (long) [was: Japanese/Unicode in Dolphin GUI...]

Chris,

>>Have you looked at the evolving implementation for Squeak? I was applying
>>my usual intermittent pressure about Squeak's compiler "hijacking" $_ for
>>assignment (as an optional editor feature, knock yourself out, letting it
>>bleed into the compiler and sources - ouch!!), and Unicode was proposed
>>as a relatively painless and widely agreeable solution to the problem.
>
> I don't really see how Unicode provides a solution to this, but anyway I
> haven't looked at it for a while.
> I dont suppose I'll ever have much time for Squeak until:
> a) I can take part in the community without using a blasted mailing list.
> b) They drop $_ completely[*].
> (and, of course:
> c) I loose my remaining marbles and come to like the Squeak UI ;-)
>
> [*] I cannot see any /reason/ why they haven't done this ages ago -- it seems
> so simple to do, just change the sources on the SqueakMap and update the
> compiler. Done.

Thanks for the "rant" - I will read it later when I have time. Re
Squeak, I really don't care if other people insist on typing '_' instead
of ':=' - just as long as the sources and compiler aren't compromised as
a result.

I agree re Morphic, but only to a point. It can be fixed if necessary;
look at Zurgle for evidence. Sadly, Zurgle is a bit heavy, does a
little too much in one "package", and emulates XP vs. a cleaner UI, but
it proves the point and then some. Squeak needs modal dialogs, but that
can be done too.

Why consider using a system that needs these and other things (re)done
to make it useable? Smalltalk, open source, portable, no runtime fees -
nuff said.

Have a good one,

Bill

--
Wilhelm K. Schwab, Ph.D.
[hidden email]

Christopher J. Demers

Re: Japanese/Unicode in Dolphin GUI...

In reply to this post by Chris Uppal-3

"Chris Uppal" <[hidden email]> wrote in message
news:[hidden email]...
> Chris,
>
> > Has anyone built a Dolphin GUI application that supports Japanese or any
> > other Unicode language?
>
> I haven't done anything like that, but I do have to interface with systems
> using 16-bit char types. I wanted to add a couple of observations and
> questions of my own.
...

Wow. I thought I knew a little about Unicode. It sounds rather complex. I
think I will pass on supporting it for now. Thanks for the information. It
will be interesting to see how Unicode support in Dolphin evolves.

Chris

Chris Uppal-3

Re: Pondering Unicode (long) [was: Japanese/Unicode in Dolphin GUI...]

In reply to this post by Chris Uppal-3

I wrote:

> That leads to an encoding as a sequence of 2-byte numbers
> called UCS-16.

Damn. Sorry. Embarrassing, but it's taken me a week to realise that I was
systematically using "UCS-16" for what is actually called "UCS-2". Similarly
what I was miscalling "UCS-8" does not exist as a /named/ format, but would
presumably be called UCS-1 if it were.

-- chris