[bug] UnicodeString encoding weirdness

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[bug] UnicodeString encoding weirdness

Robin Redeker-2
Issue status update for
http://smalltalk.gnu.org/node/113
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/113

 Project:      GNU Smalltalk
 Version:      <none>
 Component:    Base classes
 Category:     bug reports
 Priority:     normal
 Assigned to:  Unassigned
 Reported by:  elmex
 Updated by:   elmex
 Status:       active
 Attachment:   http://smalltalk.gnu.org/files/issues/unitest2.st.txt (849 bytes)

Take the attached program. Which prints here:

3
44
E3 <-> EF
81 <-> BF
AA <-> BE
E3 <-> E6
81 <-> A8
BE <-> B0
E3 <-> E7
81 <-> B8
9F <-> B0

But should print (at least as far as my understanding in
Unicode and encodings goes):

3
33
E3 <-> E3
81 <-> 81
AA <-> AA
E3 <-> E3
81 <-> 81
BE <-> BE
E3 <-> E3
81 <-> 81
9F <-> 9F




_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [bug] UnicodeString encoding weirdness

Paolo Bonzini
Issue status update for
http://smalltalk.gnu.org/project/issue/113
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/113

 Project:      GNU Smalltalk
 Version:      <none>
 Component:    Base classes
 Category:     bug reports
 Priority:     normal
 Assigned to:  Unassigned
 Reported by:  elmex
 Updated by:   bonzinip
 Status:       active
 Attachment:   http://smalltalk.gnu.org/files/issues/gst-encoding-lazy.patch (594 bytes)

EF-BF-BE is the unicode "byte order mark" (BOM) encoded in UTF-8.  It
was born as a way to distinguish big- and little-endian UTF-16.  Since
it's not really a character, Iconv tries to strip it when converting to
a UnicodeString, but it is failing to do so in this case.

Now, under Mac OS X I get the expected result, under Linux I get yours.
 The reason is that my Mac is big-endian, so Iconv produces big-endian
UTF-16, while Linux produces little-endian UTF-16.  Since the default
encoding of UTF-16 is big-endian, the Mac happens to get the right
thing, while Linux messes up the encoding.  So later on the "pipe
peekFor: $<16rFEFF>" statement to strip the BOM does not work.

The attached patch fixes this by making EncodedString look for a BOM
when retrieving the encoding, rather than when setting it.




_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [bug] UnicodeString encoding weirdness

Paolo Bonzini
In reply to this post by Robin Redeker-2
Issue status update for
http://smalltalk.gnu.org/project/issue/113
Post a follow up:
http://smalltalk.gnu.org/project/comments/add/113

 Project:      GNU Smalltalk
 Version:      <none>
 Component:    Base classes
 Category:     bug reports
 Priority:     normal
-Assigned to:  Unassigned
+Assigned to:  bonzinip
 Reported by:  elmex
 Updated by:   bonzinip
-Status:       active
+Status:       fixed

fixed in patch-612, which is the same patch I posted plus this testcase

  str := EncodedString fromString: (String new: 2) encoding: 'UTF-16'.
  str valueAt: 1 put: 254; valueAt: 2 put: 255.
  self assert: str numberOfCharacters = 0.
  str valueAt: 1 put: 255; valueAt: 2 put: 254.
  self assert: str numberOfCharacters = 0

Thanks!




_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: Re: [bug] UnicodeString encoding weirdness

Robin Redeker-2
In reply to this post by Paolo Bonzini
On Mon, Oct 22, 2007 at 02:01:23AM -0700, Paolo Bonzini wrote:
> Issue status update for
> http://smalltalk.gnu.org/project/issue/113
>
[.snip.]
>
> The attached patch fixes this by making EncodedString look for a BOM
> when retrieving the encoding, rather than when setting it.

Thanks it works now!

I hope you don't mind me filing so many bugreports :) I've been working
on my chat implementation which uses JSON recently and I'm eager to
support Unicode.


Robin


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk