Unicode support in VisualWorks incomplete?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Unicode support in VisualWorks incomplete?

Joachim Geidel
As I recently had to build a StreamEncoder for a variant of UTF-8, I
looked at the classes UTF8StreamEncoder, UTF16StreamEncoder and
UnicodeStreamEncoder. Their implementation does not support the encoding
of Characters with code points beyond the two-byte limit. These
characters are "supplementary characters" according to Unicode 4.0 and
up. In UTF-16, they are encoded by a pair of surrogate code points which
are in the two-byte range, but the UTF16StreamEncoder does not do this.
Also, there is no UTF32StreamEncoder.

It is possible to create Characters for all of the legal code points,
but also for invalid code points, and the StreamEncoders do no check if
the code points are legal - they would produce illegal encodings for
Characters beyond the highest legal code point (16r10FFFF) and for
Characters with code points which are in the ranges for surrogate code
points. The class comment of Character says that code points beyond
65535 are not fully supported.

When decoding a UTF-16 encoded file, pairs of surrogate code points must
be assembled into a supplementary character, not into two (invalid)
Characters with the code points of the surrogates. Also, isolated
surrogate code points and invalid code points should be skipped, no
Character should be created. This could be a problem when dealing with
documents which are produced by buggy software, i.e. which contain
invalid codes.

This looks as if VisualWorks only supports Unicode 3.0, but not the
newer versions of the Unicode standard (5.0 will come soon). Is this
correct, or am I missing something? It might be a problem for projects
which have to deal with documents containing supplementary characters or
their surrogates, i.e. for Chinese documents and for some other languages.

BTW, UnicodeStreamEncoder (in Internationalization) and
UTF16StreamEncoder seem to be mostly redundant. UnicodeStreamEncoder
seems to be the older, UTF16StreamEncoder the more recent
implementation, but they are largely the same.

Best regards,
Joachim Geidel