Hi, all.
>From now on, I'll report some bug about Dolphin's UnicodeString issue, as maximumly understandable as you can. First of all, since I'm Korean, I should use Hangeul(Korean character), so my environment is Korean Windows XP SP2 in code page 949, it's locale is Korea. As you know, Hangeul and other Asian character, like Kana(Japaneese), Traditional Chineese, Simple Chineese, occupy 2 character in memory, it's big matter on the program which assume 1-bye character set, like other Smalltalk System, Squeak etc.... :-) Hopefuly, Dolphin support 2-byte locale very well, I have ever uesd since 1998, version 1.1 Of course, I encountered some font problem in workspace accidantly, however, I have found the method that changing the workspace font which support Hangeul. But When Dolphin begins implementing UnicodeString, I have very serious problem. 1) the #printString In workspace, I displayed, #[16rB0 16rA1] asString asUnicodeString The result brings walkback with following message: 'Index 44033 is out of bounds'.This situation is very impact when I use inspector, or many chance that require printing UnicodeString which contain Hangeul. The problem is Character>>value:. This method assume that the Character set is composed with 256 characters. Certainly, in 1-byte locale, it's right, however, in 2-byte locale, it's not. I have create new method in UnicodeString printOn: aStream aStream nextPutAll: self asString Fortunatly, I can avoid critical situaltion, however, many method is UnicodeString don't operat correctly. 2) Creating test case class I have decided that I report this buggy situaion to Object Arts. but they can't use Hangeul or other 2-byte character(Double-byte characterset), so i create test case class using inline character code using ByteArray. Here is the file-outed class: ---------------------------------------------- "Filed out from Dolphin Smalltalk XP"! TestCase subclass: #UnicodeStringTestCase instanceVariableNames: 'ansiA ansiB ansiC ansiGA ansiNA ansiDA ansiABC ansiGANADA ansiA_GA_B_NA_C_DA uniA uniB uniC uniGA uniNA uniDA uniABC uniGANADA uniA_GA_B_NA_C_DA' classVariableNames: '' poolDictionaries: '' classInstanceVariableNames: ''! UnicodeStringTestCase guid: (GUID fromString: '{B8626AFD-3C3F-4494-A027-5A0CBF297E6C}')! UnicodeStringTestCase comment: ''! !UnicodeStringTestCase categoriesForClass!Unclassified! ! !UnicodeStringTestCase methodsFor! setUp "Hangeul(Korean Character) can be displayed in Korean-Windows only, I present Hangeul as character code using ByteArray. The character code comforms to Hangeul according to code page 949(ANSI) and Unicode." "ANSI" ansiA := 'A'. ansiB := 'B'. ansiC := 'C'. ansiGA := #[16rB0 16rA1] asString. ansiNA := #[16rB3 16rAA] asString. ansiDA := #[16rB4 16rD9] asString. ansiABC := ansiA , ansiB , ansiC. ansiGANADA := ansiGA , ansiNA , ansiDA. ansiA_GA_B_NA_C_DA := ansiA , ansiGA , ansiB , ansiNA , ansiC , ansiDA. "Unicode" uniA := 'A' asUnicodeString. uniB := 'B' asUnicodeString. uniC := 'C' asUnicodeString. "ByteArray>>asUnicodeString is absent" "In Intel, the code is stored reverse order: [xxyy] to [yy xx]." uniGA := UnicodeString fromAddress: #[16r00 16rAC] yourAddress. "#$AC00" uniNA := UnicodeString fromAddress: #[16r98 16rB0] yourAddress. "#$B098" uniDA := UnicodeString fromAddress: #[16rE4 16rB2] yourAddress. "#$B2E4" uniABC := uniA , uniB , uniC. uniGANADA := uniGA , uniNA , uniDA. uniA_GA_B_NA_C_DA := uniA , uniGA , uniB , uniNA , uniC , uniDA! testConverting "N. B. In Delphi, the character converting between AnsiStrgin and WideString are occured automaticly. So, don't use WideChar() type casting. Is there any standard or guideline to handling the identity/equality between String and UnicodeString which contains same contents? According to Smalltalk-sh mind, explicitly converting messeage may be need." self assert: (uniABC = ansiABC asUnicodeString). self assert: (uniGANADA = ansiGANADA asUnicodeString). self assert: (uniA_GA_B_NA_C_DA = ansiA_GA_B_NA_C_DA asUnicodeString). ! testIndexing self assert: ((uniABC at: 3) = uniC). self assert: ((uniGANADA at: 3) =uniDA). self assert: ((uniA_GA_B_NA_C_DA at: 3) = uniB).! testintegrity self assert: (ansiABC asUnicodeString asString = ansiABC). self assert: (ansiGANADA asUnicodeString asString = ansiGANADA). self assert: (ansiA_GA_B_NA_C_DA asUnicodeString asString = ansiA_GA_B_NA_C_DA). ! testLength self assert: (uniA size = 1). self assert: (uniABC size = 3). self assert: (uniGA size = 1). self assert: (uniGANADA size = 3). self assert: (ansiA_GA_B_NA_C_DA size = 9).! ! !UnicodeStringTestCase categoriesFor: #setUp!public! ! !UnicodeStringTestCase categoriesFor: #testConverting!public! ! !UnicodeStringTestCase categoriesFor: #testIndexing!public! ! !UnicodeStringTestCase categoriesFor: #testintegrity!public! ! !UnicodeStringTestCase categoriesFor: #testLength!public! ! ---------------------------------------------- Amazingly and humorousely, the test isn't progress from beginning in #setUp method. This method contain #, (concatnate operation), and the unicode string have Hangeul. All test is faild and I saw red-light on my SUnit Browser. Walkback windows - http://pub.paran.com/andrea92/Unicode/Walkback.JPG SUnit browser -http://pub.paran.com/andrea92/Unicode/FailedTest.JPG Debugger - http://pub.paran.com/andrea92/Unicode/Debugger.JPG How can I do? ^^ 3) Is this occuer in my Home??? If this all situation is occured in my computer and image, I'm so affraid. Please test above UnicodeTestCase class to find out the UnicodeString bug in your image. 4) I put all to OA... and some suggetions I have no idea about this buggy situation, so I put all thing to OA. I hope that my test class will help developer in OA to correct many bugs in UnicodeString. Maybe, this problem isn't replay on your image, please reply. In D6, I have smile on my entire face with very flagship toy! ^^ BTW, I suggestion about UnicodeString. In Delphi and VC, they support Unicode as WideString and WideCharacter. Why don't Dolphin support UnicodeCharacter? I know the Dolphin has some difficalty to implements Unicode natively. However, I hope that Dolphin will support Unicode perfectly, so many people use 2-byte character set in Asia and other continent is satisfied Dolphin. Below source is Delphi code which I have tested in D7. It will help you to reference my test class. In Delphi, bellow code operate perfectly. Thanks to all and pardon my poor English. Have a nice day!! ------------------------------------ unit Main; interface uses Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms, Dialogs, StdCtrls; type TForm1 = class(TForm) Button1: TButton; procedure Button1Click(Sender: TObject); private { Private declarations } public { Public declarations } end; var Form1: TForm1; implementation {$R *.dfm} procedure TForm1.Button1Click(Sender: TObject); var ansiA, ansiB, ansiC, ansiGA, ansiNA, ansiDA: AnsiString; uniA, uniB, uniC, uniGA, uniNA, uniDA: Widestring; ansiABC, ansiGANADA, ansiA_GA_B_NA_C_DA: Ansistring; uniABC, uniGANADA, uniA_GA_B_NA_C_DA: WideString; { type PInteger = ^Integer; var P: PInteger; } begin { setUp } { Hangeul(Korean Character) can be displayed in Korean-Windows only, I present Hangeul as character code using #$xx or #$xxxx. The character code comforms to Hangeul according to code page 949(ANSI) and Unicode.} // ANSI ansiA := 'A'; ansiB := 'B'; ansiC := 'C'; ansiGA := #$B0#$A1; ansiNA := #$B3#$AA; ansiDA := #$B4#$D9; ansiABC := ansiA + ansiB + ansiC; ansiGANADA := ansiGA + ansiNA + ansiDA; ansiA_GA_B_NA_C_DA := ansiA + ansiGA + ansiB + ansiNA + ansiC + ansiDA; // Unicode uniA := 'A'; uniB := 'B'; uniC := 'C'; uniGA := #$AC00; uniNA := #$B098; uniDA := #$B2E4; uniABC := uniA + uniB + uniC; uniGANADA := uniGA + uniNA + uniDA; uniA_GA_B_NA_C_DA := uniA + uniGA + uniB + uniNA + uniC + uniDA; // ShowMessage(ansiGA); { P := @uniGA[1]; ShowMessage(IntToStr((P^))); } { =-=-=-=-=-=-= Test =-=-=-=-=-=-= } { testLength } assert( Length(uniA) = 1 ); assert( Length(uniABC) = 3 ); assert( Length(uniGA) = 1 ); assert( Length(uniGANADA) = 3); assert( Length( ansiA_GA_B_NA_C_DA) = 9); { testConverting } { N. B. In Delphi, the character converting between AnsiStrgin and WideString are occured automaticly. So, don't use WideChar() type casting.} assert( uniABC = ansiABC ); assert( uniGANADA = ansiGANADA ); assert( uniA_GA_B_NA_C_DA = ansiA_GA_B_NA_C_DA ); { testintegrity } assert( AnsiString(WideString(ansiABC)) = ansiABC ); assert( AnsiString(WideString(ansiGANADA)) = ansiGANADA ); assert( AnsiString(WideString(ansiA_GA_B_NA_C_DA)) = ansiA_GA_B_NA_C_DA ); { testIndexing } assert( uniABC[3] = uniC ); assert( uniGANADA[3] =uniDA ); assert( uniA_GA_B_NA_C_DA[3] = uniB ); Close; end; end. ---------------------------------------------- |
ChanHong Kim wrote:
> But When Dolphin begins implementing UnicodeString, I have very serious > problem. Unfortunately, despite the existence of the class called 'UnicodeString', Dolphin has almost no Unicode support. It is limited to representing, as instances of 'UnicodeString', sequences of Unicode characters with code points less than 256. Which means that you are unable to use UnicodeString for anything except Western European characters (if that). I've been fighting with the problem for some time (for various reasons). There doesn't seem to be any easy way to create a quick fix for the problems (the main reason is that String assumes that characters have a fixed-width representation as binary, which is simply not true for any reasonable encoding of Unicode). If your application is such that you /can/ assume that every character is 16-bit (which is not true in general, but is true for the subset of Unicode used for English, and -- I think -- for Hangul too), then you may be able to hack it. In that case, I'd be tempted to use a custom subclass of UnicodeString which: Overrode #at: to answer either a Character or an Integer (or maybe just create new instance of Character on-the-fly if the code point is > 255 (I've tried it and it does work, though not without problems)). Overrode #at:put: to accept either a Character or Integer. Overrode #printOn: to be something like: printOn: aStream self displayString printOn: aStream. There would still be problems, I'm sure, but maybe you could just fix the ones that affected you instead of trying for a general solution. However, I think that what's really needed is a general solution. But that's /difficult/... (I know, I'm attempting to create one myself -- but it's a long way from being complete). I suspect that the best that OA can reasonably do is to provide a more complete wrapping of the Microsoft "wide string" stuff (ideally, /not/ calling it UnicodeString -- because it isn't). That would be a class similar to the current UnicodeString in implementation, but it shouldn't (seem to) contain Characters, since the elements of a MS wide string are not characters themselves, but are 16-bit integers which /encode/ characters in a non-obvious way. Another thing that they could reasonably do, is to allow us to create Characters (or similar) with code points though (at least) the full Unicode range (0 .. 0x10FFFF), and ideally through the full ISO/IEC 10646 range (0 .. 0x7FFFFFFF). -- chris |
In reply to this post by ChanHong Kim
ChanHong Kim wrote:
> 4) I put all to OA... and some suggetions > I have no idea about this buggy situation, so I put all thing to OA. I > hope that my test class will help developer in OA to correct many bugs > in UnicodeString. Maybe, this problem isn't replay on your image, > please reply. The UnicodeString situation will be unchanged in D6. -- Andy Bower Dolphin Support www.object-arts.com |
Free forum by Nabble | Edit this page |