Smalltalk › Usenets › Dolphin Smalltalk

BUG: Issue about UnicodeString

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

ChanHong Kim

BUG: Issue about UnicodeString

Hi, all.

>From now on, I'll report some bug about Dolphin's UnicodeString issue,
as maximumly understandable as you can.

First of all, since I'm Korean, I should use Hangeul(Korean character),
so my environment is Korean Windows XP SP2 in code page 949, it's
locale is Korea.

As you know, Hangeul and other Asian character, like Kana(Japaneese),
Traditional Chineese, Simple Chineese, occupy 2 character in memory,
it's big matter on the program which assume 1-bye character set, like
other Smalltalk System, Squeak etc.... :-)

Hopefuly, Dolphin support 2-byte locale very well, I have ever uesd
since 1998, version 1.1 Of course, I encountered some font problem in
workspace accidantly, however, I have found the method that changing
the workspace font which support Hangeul.

But When Dolphin begins implementing UnicodeString, I have very serious
problem.

1) the #printString
In workspace, I displayed,

#[16rB0 16rA1] asString asUnicodeString

The result brings walkback with following message: 'Index 44033 is out
of bounds'.This situation is very impact when I use inspector, or many
chance that require printing UnicodeString which contain Hangeul.

The problem is Character>>value:. This method assume that the Character
set is composed with 256 characters. Certainly, in 1-byte locale, it's
right, however, in 2-byte locale, it's not.

I have create new method in UnicodeString

printOn: aStream
aStream nextPutAll: self asString

Fortunatly, I can avoid critical situaltion, however, many method is
UnicodeString don't operat correctly.

2) Creating test case class
I have decided that I report this buggy situaion to Object Arts. but
they can't use Hangeul or other 2-byte character(Double-byte
characterset), so i create test case class using inline character code
using ByteArray.

Here is the file-outed class:

----------------------------------------------

"Filed out from Dolphin Smalltalk XP"!

TestCase subclass: #UnicodeStringTestCase
instanceVariableNames: 'ansiA ansiB ansiC ansiGA ansiNA ansiDA ansiABC
ansiGANADA ansiA_GA_B_NA_C_DA uniA uniB uniC uniGA uniNA uniDA uniABC
uniGANADA uniA_GA_B_NA_C_DA'
classVariableNames: ''
poolDictionaries: ''
classInstanceVariableNames: ''!
UnicodeStringTestCase guid: (GUID fromString:
'{B8626AFD-3C3F-4494-A027-5A0CBF297E6C}')!
UnicodeStringTestCase comment: ''!
!UnicodeStringTestCase categoriesForClass!Unclassified! !
!UnicodeStringTestCase methodsFor!

setUp
"Hangeul(Korean Character) can be displayed in Korean-Windows only,
I present Hangeul as character code using ByteArray.
The character code comforms to Hangeul according to
code page 949(ANSI) and Unicode."

"ANSI"

ansiA := 'A'.
ansiB := 'B'.
ansiC := 'C'.
ansiGA := #[16rB0 16rA1] asString.
ansiNA := #[16rB3 16rAA] asString.
ansiDA := #[16rB4 16rD9] asString.
ansiABC := ansiA , ansiB , ansiC.
ansiGANADA := ansiGA , ansiNA , ansiDA.
ansiA_GA_B_NA_C_DA := ansiA , ansiGA , ansiB , ansiNA , ansiC ,
ansiDA.

"Unicode"
uniA := 'A' asUnicodeString.
uniB := 'B' asUnicodeString.
uniC := 'C' asUnicodeString.

"ByteArray>>asUnicodeString is absent"
"In Intel, the code is stored reverse order: [xxyy] to [yy xx]."
uniGA := UnicodeString fromAddress: #[16r00 16rAC]
yourAddress. "#$AC00"
uniNA := UnicodeString fromAddress: #[16r98 16rB0]
yourAddress. "#$B098"
uniDA := UnicodeString fromAddress: #[16rE4 16rB2]
yourAddress. "#$B2E4"
uniABC := uniA , uniB , uniC.
uniGANADA := uniGA , uniNA , uniDA.
uniA_GA_B_NA_C_DA := uniA , uniGA , uniB , uniNA , uniC , uniDA!

testConverting
"N. B. In Delphi, the character converting between AnsiStrgin and
WideString are occured automaticly. So, don't use WideChar() type
casting.
Is there any standard or guideline to handling the identity/equality
between
String and UnicodeString which contains same contents?
According to Smalltalk-sh mind, explicitly converting messeage may be
need."

self assert: (uniABC = ansiABC asUnicodeString).
self assert: (uniGANADA = ansiGANADA asUnicodeString).
self assert: (uniA_GA_B_NA_C_DA = ansiA_GA_B_NA_C_DA asUnicodeString).
!

testIndexing

self assert: ((uniABC at: 3) = uniC).
self assert: ((uniGANADA at: 3) =uniDA).
self assert: ((uniA_GA_B_NA_C_DA at: 3) = uniB).!

testintegrity

self assert: (ansiABC asUnicodeString asString = ansiABC).
self assert: (ansiGANADA asUnicodeString asString = ansiGANADA).
self assert: (ansiA_GA_B_NA_C_DA asUnicodeString asString =
ansiA_GA_B_NA_C_DA).
!

testLength

self assert: (uniA size = 1).
self assert: (uniABC size = 3).

self assert: (uniGA size = 1).
self assert: (uniGANADA size = 3).

self assert: (ansiA_GA_B_NA_C_DA size = 9).! !
!UnicodeStringTestCase categoriesFor: #setUp!public! !
!UnicodeStringTestCase categoriesFor: #testConverting!public! !
!UnicodeStringTestCase categoriesFor: #testIndexing!public! !
!UnicodeStringTestCase categoriesFor: #testintegrity!public! !
!UnicodeStringTestCase categoriesFor: #testLength!public! !

----------------------------------------------

Amazingly and humorousely, the test isn't progress from beginning in
#setUp method. This method contain #, (concatnate operation), and the
unicode string have Hangeul. All test is faild and I saw red-light on
my SUnit Browser.

Walkback windows
- http://pub.paran.com/andrea92/Unicode/Walkback.JPG

SUnit browser
-http://pub.paran.com/andrea92/Unicode/FailedTest.JPG

Debugger
- http://pub.paran.com/andrea92/Unicode/Debugger.JPG

How can I do? ^^

3) Is this occuer in my Home???
If this all situation is occured in my computer and image, I'm so
affraid.
Please test above UnicodeTestCase class to find out the UnicodeString
bug in your image.

4) I put all to OA... and some suggetions
I have no idea about this buggy situation, so I put all thing to OA. I
hope that my test class will help developer in OA to correct many bugs
in UnicodeString. Maybe, this problem isn't replay on your image,
please reply.

In D6, I have smile on my entire face with very flagship toy! ^^

BTW, I suggestion about UnicodeString.
In Delphi and VC, they support Unicode as WideString and WideCharacter.
Why don't Dolphin support UnicodeCharacter? I know the Dolphin has some
difficalty to implements Unicode natively. However, I hope that Dolphin
will support Unicode perfectly, so many people use 2-byte character set
in Asia and other continent is satisfied Dolphin.

Below source is Delphi code which I have tested in D7. It will help you
to reference my test class. In Delphi, bellow code operate perfectly.

Thanks to all and pardon my poor English.

Have a nice day!!

------------------------------------
unit Main;

interface

uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls,
Forms,
Dialogs, StdCtrls;

type
TForm1 = class(TForm)
Button1: TButton;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;

var
Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
ansiA, ansiB, ansiC, ansiGA, ansiNA, ansiDA: AnsiString;
uniA, uniB, uniC, uniGA, uniNA, uniDA: Widestring;
ansiABC, ansiGANADA, ansiA_GA_B_NA_C_DA: Ansistring;
uniABC, uniGANADA, uniA_GA_B_NA_C_DA: WideString;
{
type
PInteger = ^Integer;
var
P: PInteger;
}

begin
{ setUp }
{ Hangeul(Korean Character) can be displayed in Korean-Windows only,
I present Hangeul as character code using #$xx or #$xxxx.
The character code comforms to Hangeul according to
code page 949(ANSI) and Unicode.}

// ANSI
ansiA := 'A';
ansiB := 'B';
ansiC := 'C';

ansiGA := #$B0#$A1;
ansiNA := #$B3#$AA;
ansiDA := #$B4#$D9;

ansiABC := ansiA + ansiB + ansiC;
ansiGANADA := ansiGA + ansiNA + ansiDA;
ansiA_GA_B_NA_C_DA := ansiA + ansiGA + ansiB + ansiNA + ansiC + ansiDA;

// Unicode
uniA := 'A';
uniB := 'B';
uniC := 'C';

uniGA := #$AC00;
uniNA := #$B098;
uniDA := #$B2E4;

uniABC := uniA + uniB + uniC;
uniGANADA := uniGA + uniNA + uniDA;
uniA_GA_B_NA_C_DA := uniA + uniGA + uniB + uniNA + uniC + uniDA;

// ShowMessage(ansiGA);

{ P := @uniGA[1];
ShowMessage(IntToStr((P^))); }

{ =-=-=-=-=-=-= Test =-=-=-=-=-=-= }
{ testLength }
assert( Length(uniA) = 1 );
assert( Length(uniABC) = 3 );

assert( Length(uniGA) = 1 );
assert( Length(uniGANADA) = 3);

assert( Length( ansiA_GA_B_NA_C_DA) = 9);

{ testConverting }
{ N. B. In Delphi, the character converting between AnsiStrgin and
WideString are occured automaticly. So, don't use WideChar() type
casting.}
assert( uniABC = ansiABC );
assert( uniGANADA = ansiGANADA );
assert( uniA_GA_B_NA_C_DA = ansiA_GA_B_NA_C_DA );

{ testintegrity }
assert( AnsiString(WideString(ansiABC)) = ansiABC );
assert( AnsiString(WideString(ansiGANADA)) = ansiGANADA );
assert( AnsiString(WideString(ansiA_GA_B_NA_C_DA)) = ansiA_GA_B_NA_C_DA
);

{ testIndexing }
assert( uniABC[3] = uniC );
assert( uniGANADA[3] =uniDA );
assert( uniA_GA_B_NA_C_DA[3] = uniB );

Close;
end;

end.
----------------------------------------------

Chris Uppal-3

Re: Issue about UnicodeString

ChanHong Kim wrote:

> But When Dolphin begins implementing UnicodeString, I have very serious
> problem.

Unfortunately, despite the existence of the class called 'UnicodeString',
Dolphin has almost no Unicode support. It is limited to representing, as
instances of 'UnicodeString', sequences of Unicode characters with code
points less than 256. Which means that you are unable to use UnicodeString
for anything except Western European characters (if that).

I've been fighting with the problem for some time (for various reasons). There
doesn't seem to be any easy way to create a quick fix for the problems (the
main reason is that String assumes that characters have a fixed-width
representation as binary, which is simply not true for any reasonable encoding
of Unicode). If your application is such that you /can/ assume that every
character is 16-bit (which is not true in general, but is true for the subset
of Unicode used for English, and -- I think -- for Hangul too), then you may be
able to hack it.

In that case, I'd be tempted to use a custom subclass of UnicodeString which:

Overrode #at: to answer either a Character or an Integer (or maybe just create
new instance of Character on-the-fly if the code point is > 255 (I've tried it
and it does work, though not without problems)).

Overrode #at:put: to accept either a Character or Integer.

Overrode #printOn: to be something like:

printOn: aStream
self displayString printOn: aStream.

There would still be problems, I'm sure, but maybe you could just fix the ones
that affected you instead of trying for a general solution.

However, I think that what's really needed is a general solution. But that's
/difficult/... (I know, I'm attempting to create one myself -- but it's a long
way from being complete). I suspect that the best that OA can reasonably do is
to provide a more complete wrapping of the Microsoft "wide string" stuff
(ideally, /not/ calling it UnicodeString -- because it isn't). That would be a
class similar to the current UnicodeString in implementation, but it shouldn't
(seem to) contain Characters, since the elements of a MS wide string are not
characters themselves, but are 16-bit integers which /encode/ characters in a
non-obvious way. Another thing that they could reasonably do, is to allow us
to create Characters (or similar) with code points though (at least) the full
Unicode range (0 .. 0x10FFFF), and ideally through the full ISO/IEC 10646 range
(0 .. 0x7FFFFFFF).

-- chris

Andy Bower-3

Re: BUG: Issue about UnicodeString

In reply to this post by ChanHong Kim

ChanHong Kim wrote:

> 4) I put all to OA... and some suggetions
> I have no idea about this buggy situation, so I put all thing to OA. I
> hope that my test class will help developer in OA to correct many bugs
> in UnicodeString. Maybe, this problem isn't replay on your image,
> please reply.

The UnicodeString situation will be unchanged in D6.

--
Andy Bower
Dolphin Support
www.object-arts.com