Smalltalk › Gnu

[bug] UnicodeString encoding weirdness

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Robin Redeker-2

[bug] UnicodeString encoding weirdness

Issue status update for
http://smalltalk.gnu.org/node/113
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/113

Project: GNU Smalltalk
Version: <none>
Component: Base classes
Category: bug reports
Priority: normal
Assigned to: Unassigned
Reported by: elmex
Updated by: elmex
Status: active
Attachment: http://smalltalk.gnu.org/files/issues/unitest2.st.txt (849 bytes)

Take the attached program. Which prints here:

3
44
E3 <-> EF
81 <-> BF
AA <-> BE
E3 <-> E6
81 <-> A8
BE <-> B0
E3 <-> E7
81 <-> B8
9F <-> B0

But should print (at least as far as my understanding in
Unicode and encodings goes):

3
33
E3 <-> E3
81 <-> 81
AA <-> AA
E3 <-> E3
81 <-> 81
BE <-> BE
E3 <-> E3
81 <-> 81
9F <-> 9F

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: [bug] UnicodeString encoding weirdness

Issue status update for
http://smalltalk.gnu.org/project/issue/113
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/113

Project: GNU Smalltalk
Version: <none>
Component: Base classes
Category: bug reports
Priority: normal
Assigned to: Unassigned
Reported by: elmex
Updated by: bonzinip
Status: active
Attachment: http://smalltalk.gnu.org/files/issues/gst-encoding-lazy.patch (594 bytes)

EF-BF-BE is the unicode "byte order mark" (BOM) encoded in UTF-8. It
was born as a way to distinguish big- and little-endian UTF-16. Since
it's not really a character, Iconv tries to strip it when converting to
a UnicodeString, but it is failing to do so in this case.

Now, under Mac OS X I get the expected result, under Linux I get yours.
The reason is that my Mac is big-endian, so Iconv produces big-endian
UTF-16, while Linux produces little-endian UTF-16. Since the default
encoding of UTF-16 is big-endian, the Mac happens to get the right
thing, while Linux messes up the encoding. So later on the "pipe
peekFor: $<16rFEFF>" statement to strip the BOM does not work.

The attached patch fixes this by making EncodedString look for a BOM
when retrieving the encoding, rather than when setting it.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: [bug] UnicodeString encoding weirdness

In reply to this post by Robin Redeker-2

Robin Redeker-2

Re: Re: [bug] UnicodeString encoding weirdness

In reply to this post by Paolo Bonzini

On Mon, Oct 22, 2007 at 02:01:23AM -0700, Paolo Bonzini wrote:
> Issue status update for
> http://smalltalk.gnu.org/project/issue/113
>
[.snip.]
>
> The attached patch fixes this by making EncodedString look for a BOM
> when retrieving the encoding, rather than when setting it.

Thanks it works now!

I hope you don't mind me filing so many bugreports :) I've been working
on my chat implementation which uses JSON recently and I'm eager to
support Unicode.

Robin

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk