Has anyone built a Dolphin GUI application that supports Japanese or any
other Unicode language? I am investigating translating our application into Japanese. It looks like Dolphin uses SetWindowTextA to put text in text boxes. It looks like I _may_ be able to take advantage of reflection to get a UnicodeString to use SetWindowTextW. If I can set an appropriate font, and get the Unicode text into the controls I think it may be OK. I would be interested to hear if anyone has any experience with Unicode in Dolphin GUI applications. Chris |
Chris,
> Has anyone built a Dolphin GUI application that supports Japanese or any > other Unicode language? I haven't done anything like that, but I do have to interface with systems using 16-bit char types. I wanted to add a couple of observations and questions of my own. I don't know if this may be of any help to you, Chris, but one thing that has helped me a bit is that it turns out to be possible to create new instances of Character that wrap Integer "code points" that are > 255. They don't have the "singleton" property (of being pseudo-immediate and compared by #==) but they do work, sort of... It'd be interesting to know if that hack is: - a complete no-no. - something that /may/ work, but it's at our own risk. - something that OA might consider supporting in the future. It'd help a great deal if UnicodeString were able to accept/answer 16-bit Integers (or, better, 16-bit Characters as above) rather than inheriting operations from String that refuse them. (It may be that this has changed since I last looked). It'd help a /great deal/ if UnicodeString were actually a Unicode string -- i.e. able to accept code points in the /full/ Unicode range. I suspect that the class is actually intended to represent a UTF-16 encoded string, but I don't know. Ideally, I'd like to see (I don't claim this is practical) the String hierarchy refactored into an abstract String, with concrete subclasses that stored data in UTF-8, UCS-8, UTF-16, UCS-16, and "plain" (but big) UTF-32. As it is, we basically have UCS-8 (String) and a rather crippled UCS-16 (UnicodeString), and nothing else. I'm not screaming for any of this, and I'm certainly not asking for it /now/, but I would like to know where (if anywhere, yet) OA are thinking of taking this aspect of Dolphin. Above all I'd like to be sure that we/they aren't going to go down the Java route and introduce a brain-damaged[*] imitation of Unicode that will be a major problem for years to come. -- chris [*] "brain-damaged" is an understatement, but if I really pushed to find the right words to express the immeasurable idiocy of Java's "unicode" strings, then I'd be banned for NG abuse.... |
Chris,
> Ideally, I'd like to see (I don't claim this is practical) the String hierarchy > refactored into an abstract String, with concrete subclasses that stored data > in UTF-8, UCS-8, UTF-16, UCS-16, and "plain" (but big) UTF-32. As it is, we > basically have UCS-8 (String) and a rather crippled UCS-16 (UnicodeString), and > nothing else. Interesting. To date, the only "unicode" that I've seen is in doing serial communications with an aging physiologic monitor, and with Windows' *U() functions. It sure seems as though both treat "unicode" as doubling string length in the faint hope that it would be useful some day. Do you agree? Where does that fit in your list of string types? > I'm not screaming for any of this, and I'm certainly not asking for it /now/, > but I would like to know where (if anywhere, yet) OA are thinking of taking > this aspect of Dolphin. Above all I'd like to be sure that we/they aren't > going to go down the Java route and introduce a brain-damaged[*] imitation of > Unicode that will be a major problem for years to come. Have you looked at the evolving implementation for Squeak? I was applying my usual intermittent pressure about Squeak's compiler "hijacking" $_ for assignment (as an optional editor feature, knock yourself out, letting it bleed into the compiler and sources - ouch!!), and Unicode was proposed as a relatively painless and widely agreeable solution to the problem. To that, I replied, (paraphrasing) "ok, but are you simply going to double the length of every string?". I was fearing the worst and hoping for better. They appear to have better in mind. Take a look. It was sounding as though Squeak 3.8 or 4.0 should merge the effort. That is unlikely to be in time to help D6, but hopefully it will be in time to help the following version. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
Bill, and anyone who is interested in a longish rant on Unicode,
> > Ideally, I'd like to see (I don't claim this is practical) the String > > hierarchy refactored into an abstract String, with concrete subclasses > > that stored data in UTF-8, UCS-8, UTF-16, UCS-16, and "plain" (but big) > > UTF-32. As it is, we basically have UCS-8 (String) and a rather > > crippled UCS-16 (UnicodeString), and nothing else. > > Interesting. To date, the only "unicode" that I've seen is in doing > serial communications with an aging physiologic monitor, and with > Windows' *U() functions. It sure seems as though both treat "unicode" as > doubling string length in the faint hope that it would be useful some > day. Do you agree? Where does that fit in your list of string types? Well, nowhere really. Not as stated, anyway. In case if help to clarify, here's a quick rundown on Unicode as I see it. I'd welcome corrections, expansions, contradictions, etc, from anyone since my own understanding of these issues is still incomplete. In a group with readers all over the world, like this one, I'm sure there are people with much better knowledge of this stuff than a poor Western European like myself. Still, FWIW, here's my take on it: [Please note that I am making no effort whatever to use the correct Unicode terminology (which I find pretty baffling) except for the above Uxxx-n names] The important first thing is to distinguish between the abstraction of a string, and its concrete storage. In Smalltalk terms that /could/ be expressed as a protocol <UnicodeString> (and perhaps an abstract base class UnicodeString) which is an <ArrayedCollection> of Unicode characters. "characters" here is something of a misnomer since the elements of a Unicode string may not be (but often are) in 1-to-1 correspondence with what a speaker of the language would call its written characters (if it has such things at all), but that's a separate issue and can (I think) be ignored for these purposes. The "characters" themselves are identified by their "code point" (hence the ANSI-standard method Character>>codePoint) which is a positive integer in an undefined range. The standard doesn't define the range, but there is no UTF-* encoding defined that will handle >24 bits, and the Unicode people have (I believe) stated that they will never define characters outside that range. If you read some uninformed talk about Unicode then you might come away with the impression that Unicode characters outside the 16-bit range are somehow "different" (there is talk of "surrogate pairs"). This is incorrect, there is no difference whatever (I think the fault ultimately lies with the misleading language used by the Unicode people themselves.) Now a <UnicodeString> holding UnicodeCharacters could be represented in lots of ways. E.g. one could use a Dictionary mapping indexes to instances of UnicodeCharacter. But the Unicode consortium define a few "standard" ways of representing them as concrete sequences of bytes (the definitions are most inherited from the ISO equivalent of Unicode). This is where the language of Unicode gets really confusing, and I don't have it all straight by any means, but this is a simplified view which I /hope/ is not misleading (or wrong !). The "obvious" way to store <UnicodeString> is as a sequence of 32-bit numbers, that encoding is known as UTF-32. It has the advantage (which no other encoding has) that it can represent /all/ Unicode characters /and/ allow constant time access to each one. It also has a nice simple implementation. Unfortunately, it is very space inefficient, and -- for that reason -- tends not to be used much. The next most "obvious" way to store <UnicodeString>s would be as sequences of 3-byte, 24-bit, numbers. That would have most of the advantages of UTF-32 and be rather less wasteful of space. It would have the disadvantage that access to individual characters would not be aligned on 32-bit boundaries. If that encoding existed, then it'd presumably be called UTF-24, but there is no such format defined. I don't know why not; frankly, I find it puzzling. The next easy option is just to say "To Hell with half the world's population; we're going to pretend that all Unicode characters are in the 16-bit range". That leads to an encoding as a sequence of 2-byte numbers called UCS-16. It is impossible to represent characters with code points > 65535, and any attempt to add such a character to a UCS-16 string would cause a runtime error. This makes it impossible to represent most written Chinese properly, for instance, although (IIRC) most indo-european languages can be represented this way. IMHO, this particular option is the most brain-dead available, but is the one chosen by the Java designers (though they hid it skilfully by misusing terms like Unicode and UTF-8 all over the shop). That decision will be changed in the forthcoming Java version; they will by fiat define that Java strings were in UTF-16 all along. This will cause endless problems for Java programmers which I am not going to describe since I'd start to shout and wave my arms around, maybe also foam at the mouth or even go into fits. Oddly, the .NET CLR seems to have the same problem -- at least the character type is defined as a 16-bit quantity, which makes it useless for representing Unicode characters. The /really/ easy approach is to go one further than UCS-16 and ignore code points higher than 255. That means you can store a string as a sequence of 8-bit numbers (bytes). This encoding is called UCS-8. It is a tempting option since it minimises space, and is /really/ simple to implement. The problem is that it sort of misses the point of Unicode entirely... In a sense, Dolphin already has support for this encoding in its normal String class (or ByteArray). The UCS-16 and UCS-8 encodings both have the advantage that it is easy to go from a logical character position to the physical representation of that character. The place where a character is stored does not depend on the previous characters in the string, only on its own position. (Actually, that is only superficially true, because of the point about non-1-to-1 correspondence I mentioned above, but that may not matter depending on the application.) They both have the disadvantage that they can't represent the entire defined range of UnicodeCharacters. The next two encodings reverse those advantages and disadvantages. Remember that we are talking about the concrete representation of <UnicodeString>s in memory. The same abstract <UnicodeString> could be represented as UTF-32, or (if it happened to use a limited range of characters) UCS-16, or UCS-8. Probably the most common encoding is UTF-8. This is a variable-width encoding; characters with code points < 128 are represented by single bytes; other characters are represented by "escape sequences" of up to five bytes. (I may have the details wrong, but that's not important here). The encoding is such that any code point up to 2**24 can be represented, and so any (current or projected) Unicode character can be used in a UTF-8 string. The disadvantage is that there is no longer a simple mapping from logical character positions to the location of the corresponding number(s) in store. Hence the implementation of UTF8String>>at: would have to scan the entire beginning part of the byte-sequence to find the Nth character, so #at: and #at:put: would no longer be constant-time operations. (That could be optimised somewhat, but at the expense of even more complexity.) (Incidentally, the way that the encoding works is rather clever, so that if you are looking at the raw bytes somewhere in the middle of a UTF-8 string then you can always tell whether you are looking at a directly encoded character or into an "escape sequence", and you can always find the start of a current character by scanning backwards not more than 4 bytes -- you don't have to go back to the beginning of the String). For strings that mostly consist of characters with code points in the 0...127 range (essentially ASCII) UTF-8 can be nicely space-efficient, otherwise it can be quite expensive. UTF-16 is just like UTF-8 except that it uses 16-bit numbers instead of 8-bit. As a result the escape sequences are less complicated than in UTF-8, and also occur less often (and not at all for texts in some languages). And that last point is where problems can start. Given an API which works in terms of "wide" (16-bit) "characters" it can be difficult to tell whether the sequences of 16-bit numbers are intended to be interpreted as UCS-16 or UTF-16. The fact that in many cases (especially in the West) there is no obvious difference makes people sloppy about the matter. I still have no idea whether the Windows *W() functions use UTF-16 or UCS-16 (I admit that I haven't made much effort to find out -- partly because I fear the worst: that it'd turn out to be a mixture depending on the API/OS/version in use). Actually, there /is/ a difference between UTF-16 and UCS-16 even for restricted character ranges (and similarly for UTF-8 and UCS-8). In UCS-16 any random bit pattern is a "valid" string in the sense that it is unambiguous (though it may not contain defined Unicode characters). This is not true for UTF-16. One reason is that some bit-sequences are "broken" escape sequences which cannot be decoded according to the defined rules. Another reason is that some bit-sequences are actually /forbidden/ and the implementation is required to detect these cases and report an error. This is because they would otherwise be interpreted as non-standard encodings of strings with simpler representations (according to the rules) and that would make security even harder than it already is (think of detecting ".."s in URLs, for instance; if that character sequence had more than one encoding in Unicode then you could be sure that someone would forget to check for all the cases sooner or later). Despite this, it's still tempting to re-use an existing library (or API) that is defined in terms of UCS-16 and just re-interpret it as UTF-16. (BTW, Bill, those last two paragraphs are my best answer to your question that I started with.) One thing that isn't really relevant, but I think I should add is that all the above is about the representation of <UnicodeString>s in memory where the concept of a "number" is mostly unambiguous. In contexts where the bit-level representation of numbers matters (e.g. when sending Unicode text over a network, or to file) you also have to deal with whether the representation is big-endian or little-endian. Unicode distinguishes between the external representation and the internal and has UTF-16BE and UTF-16LE for the two possible external flavours of UTF-16 (and similarly for UTF-32; ordering is not an issue for UTF-8). It also uses a "byte order mark" (known as the BOM, or, more affectionately, as a "non-breaking zero-width space") which can be (but doesn't have to be) prepended to the external representation to indicate the byte-order. The real Unicode standard isn't very much about how strings should be represented, the bulk of the standard is about defining characters, diacritic marks, collating sequences, character classes, and all sorts of things that matter to people reading the text. (Incidentally, and to be fair, the Java designers have done a better job of mapping that stuff into Java than they have of the storage issues) I don't expect Dolphin to provide out-of-the-box wrappers for all that stuff -- even assuming that it is provided in Windows somewhere. But I would like to understand (and be able to manipulate) the various physical representations that I'm likely to come across in practise, and to be able to tell /which/ representation is in use in a particular context. > Have you looked at the evolving implementation for Squeak? I was applying > my usual intermittent pressure about Squeak's compiler "hijacking" $_ for > assignment (as an optional editor feature, knock yourself out, letting it > bleed into the compiler and sources - ouch!!), and Unicode was proposed > as a relatively painless and widely agreeable solution to the problem. I don't really see how Unicode provides a solution to this, but anyway I haven't looked at it for a while. I dont suppose I'll ever have much time for Squeak until: a) I can take part in the community without using a blasted mailing list. b) They drop $_ completely[*]. (and, of course: c) I loose my remaining marbles and come to like the Squeak UI ;-) [*] I cannot see any /reason/ why they haven't done this ages ago -- it seems so simple to do, just change the sources on the SqueakMap and update the compiler. Done. > To that, I replied, (paraphrasing) "ok, but are you simply going to > double the length of every string?". I was fearing the worst and hoping > for better. They appear to have better in mind. Take a look. It was > sounding as though Squeak 3.8 or 4.0 should merge the effort. That is > unlikely to be in time to help D6, but hopefully it will be in time to > help the following version. Currently I have the 3.6 version (is there a pre-packaged download of anything with their emerging Unicode support ?). As far as I can see from a quick look, they'll face the same problem that Dolphin would. String is a concrete class (and a variable byte class at that) which makes it difficult to introduce a separation between abstraction and concrete representation(s). (Still, at least they haven't pre-empted the name 'UnicodeString' to mean something else ;-) -- chris |
Chris,
>>Have you looked at the evolving implementation for Squeak? I was applying >>my usual intermittent pressure about Squeak's compiler "hijacking" $_ for >>assignment (as an optional editor feature, knock yourself out, letting it >>bleed into the compiler and sources - ouch!!), and Unicode was proposed >>as a relatively painless and widely agreeable solution to the problem. > > I don't really see how Unicode provides a solution to this, but anyway I > haven't looked at it for a while. > I dont suppose I'll ever have much time for Squeak until: > a) I can take part in the community without using a blasted mailing list. > b) They drop $_ completely[*]. > (and, of course: > c) I loose my remaining marbles and come to like the Squeak UI ;-) > > [*] I cannot see any /reason/ why they haven't done this ages ago -- it seems > so simple to do, just change the sources on the SqueakMap and update the > compiler. Done. Thanks for the "rant" - I will read it later when I have time. Re Squeak, I really don't care if other people insist on typing '_' instead of ':=' - just as long as the sources and compiler aren't compromised as a result. I agree re Morphic, but only to a point. It can be fixed if necessary; look at Zurgle for evidence. Sadly, Zurgle is a bit heavy, does a little too much in one "package", and emulates XP vs. a cleaner UI, but it proves the point and then some. Squeak needs modal dialogs, but that can be done too. Why consider using a system that needs these and other things (re)done to make it useable? Smalltalk, open source, portable, no runtime fees - nuff said. Have a good one, Bill -- Wilhelm K. Schwab, Ph.D. [hidden email] |
In reply to this post by Chris Uppal-3
"Chris Uppal" <[hidden email]> wrote in message
news:[hidden email]... > Chris, > > > Has anyone built a Dolphin GUI application that supports Japanese or any > > other Unicode language? > > I haven't done anything like that, but I do have to interface with systems > using 16-bit char types. I wanted to add a couple of observations and > questions of my own. ... Wow. I thought I knew a little about Unicode. It sounds rather complex. I think I will pass on supporting it for now. Thanks for the information. It will be interesting to see how Unicode support in Dolphin evolves. Chris |
In reply to this post by Chris Uppal-3
I wrote:
> That leads to an encoding as a sequence of 2-byte numbers > called UCS-16. Damn. Sorry. Embarrassing, but it's taken me a week to realise that I was systematically using "UCS-16" for what is actually called "UCS-2". Similarly what I was miscalling "UCS-8" does not exist as a /named/ format, but would presumably be called UCS-1 if it were. -- chris |
Free forum by Nabble | Edit this page |