[Q] Bug in EncodedStream?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Q] Bug in EncodedStream?

Chun, Sungjin
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

When I run following:

(I18N.EncodedStream encoding: (UnicodeString fromString: '전성진'))
contents !

gst emits endless messages related to garbage collecting then crashes
with segmentation faults. The content of the string is UTF-8 encoded
Korean text(9 byte, 3 characters).

And, are there any simple example for processing UTF-8 encoded string?

Thanks in advance.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFMzgrQqspS1+XJHgRAm4bAJoCCY4J1SiT6yloR54qlIcjeoIplgCeIy3t
JoLjMRkAijV6ZoxBI+exYV4=
=kVE1
-----END PGP SIGNATURE-----


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [Q] Bug in EncodedStream?

Paolo Bonzini
Sungjin Chun wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> When I run following:
>
> (I18N.EncodedStream encoding: (UnicodeString fromString: '전성진'))
> contents !
>
> gst emits endless messages related to garbage collecting then crashes
> with segmentation faults.
Yes, it is a stupid bug.  When using the system function iconv, gst has
to split the UnicodeCharacters back into 8-bit Characters, and here it
gets stuck in an infinite loop.  The first character for example is
$<16rC804>, and the "C8" byte is created as a UnicodeCharacter rather
than a Character.  This causes a recursive creation of another
I18N.EncodedStream.

The attached patch fixes the bug; thanks for reporting it.

In my testing, I only used Eastern-European characters where all bytes
are < 0x80.
> And, are there any simple example for processing UTF-8 encoded string?
>  
Can you expand?

Paolo

--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
@@ -718,13 +718,13 @@ next
          been extracted."
  wch := answer := self nextInput codePoint.
  wch := (wch bitShift: -8) + 16r1000000.
- ^(answer bitAnd: 255) asCharacter
+ ^Character value: (answer bitAnd: 255)
     ].
 
     "Answer any other byte"
     answer := wch bitAnd: 255.
     wch := wch bitShift: -8.
-    ^answer asCharacter
+    ^Character value: answer
 !
 
 flush
@@ -754,7 +754,7 @@ next
  wch := answer := self nextInput codePoint.
  wch := wch bitAnd: 16rFFFFFF.
  count := 3.
- ^(answer bitShift: -24) asCharacter
+ ^Character value: (answer bitShift: -24)
     ].
 
     "Answer any other byte.  We keep things so that the byte we answer
@@ -763,7 +763,7 @@ next
     wch := wch bitAnd: 16rFFFF.
     wch := wch bitShift: 8.
     count := count - 1.
-    ^answer asCharacter
+    ^Character value: answer
 !
 
 flush

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: [Q] Bug in EncodedStream?

Chun, Sungjin
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Paolo Bonzini wrote:

>> And, are there any simple example for processing UTF-8 encoded string?
>>  
> Can you expand?
>
> Paolo

I mean that I want example code which shows good pattern on dealing
multibyte string :-) For example, I'm not sure whether this code is good
or not:

str _ UnicodeString fromString: 'Some UTF-8 Encoded String'.

It seems that

str _ UnicodeString fromString: 'Some UTF-8 Encoded String' encoding:
UTF8StringEncoding.

like code is better. (because I can let UnicodeString know the exact
encoding of given string or array of bytes.)

Thanks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFM14tQqspS1+XJHgRAtzXAJ98DIvDL40F++aV7qgRywYeQfo1MwCfd4yz
G5F7YsjMIP4MCpLIkZy/o0M=
=1eL+
-----END PGP SIGNATURE-----


_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk
Reply | Threaded
Open this post in threaded view
|

Re: {Spam?} Re: [Q] Bug in EncodedStream?

Paolo Bonzini

> I mean that I want example code which shows good pattern on dealing
> multibyte string :-) For example, I'm not sure whether this code is good
> or not:
>
> str _ UnicodeString fromString: 'Some UTF-8 Encoded String'.
>  
It is if your default encoding is UTF-8, or if the encoded string
includes a byte-order mark (for this, you need the attached patch :-( ...).

For example, this works:

st> #[254 255 200 4 193 49 201 196] asString encoding!
'UTF-16BE'
> str _ UnicodeString fromString: 'Some UTF-8 Encoded String' encoding:
> UTF8StringEncoding.
>  
UTF8StringEncoding is written 'UTF-8'.

Paolo


* auto-adding [hidden email]--2004b/smalltalk--devo--2.2--patch-152 to greedy revision library /Users/bonzinip/Archives/revlib
* found immediate ancestor revision in library ([hidden email]--2004b/smalltalk--devo--2.2--patch-151)
* patching for this revision ([hidden email]--2004b/smalltalk--devo--2.2--patch-152)
--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
@@ -1289,21 +1289,21 @@ encoding
      default locale's default charset"
 
     | encoding |
-    (self size >= 4 and: [ (self at: 1) = 0 and: [ (self at: 2) = 0 and: [
-     (self at: 3) = 254 and: [
-     (self at: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ].
-    (self size >= 4 and: [ (self at: 4) = 0 and: [ (self at: 3) = 0 and: [
-     (self at: 2) = 254 and: [
-     (self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
+    (self size >= 4 and: [ (self valueAt: 1) = 0 and: [ (self valueAt: 2) = 0 and: [
+     (self valueAt: 3) = 254 and: [
+     (self valueAt: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ].
+    (self size >= 4 and: [ (self valueAt: 4) = 0 and: [ (self valueAt: 3) = 0 and: [
+     (self valueAt: 2) = 254 and: [
+     (self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
     (self size >= 2 and: [
-     (self at: 1) = 254 and: [
-     (self at: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ].
+     (self valueAt: 1) = 254 and: [
+     (self valueAt: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ].
     (self size >= 2 and: [
-     (self at: 2) = 254 and: [
-     (self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
-    (self size >= 3 and: [ (self at: 1) = 16rEF and: [
-     (self at: 2) = 16rBB and: [
-     (self at: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ].
+     (self valueAt: 2) = 254 and: [
+     (self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
+    (self size >= 3 and: [ (self valueAt: 1) = 16rEF and: [
+     (self valueAt: 2) = 16rBB and: [
+     (self valueAt: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ].
 
     encoding := self class defaultEncoding.
     encoding asString = 'UTF-16' ifTrue: [ ^self utf16Encoding ].
@@ -1314,9 +1314,9 @@ utf32Encoding
     "Assuming the receiver is encoded as UTF-16 with a proper
      endianness marker, answer the correct encoding of the receiver."
 
-    (self size >= 4 and: [ (self at: 4) = 0 and: [ (self at: 3) = 0 and: [
-     (self at: 2) = 254 and: [
-     (self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
+    (self size >= 4 and: [ (self valueAt: 4) = 0 and: [ (self valueAt: 3) = 0 and: [
+     (self valueAt: 2) = 254 and: [
+     (self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
     ^'UTF-32BE'
 !
 
@@ -1325,8 +1325,8 @@ utf16Encoding
      endianness marker, answer the correct encoding of the receiver."
 
     (self size >= 2 and: [
-     (self at: 2) = 254 and: [
-     (self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
+     (self valueAt: 2) = 254 and: [
+     (self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
     ^'UTF-16BE'
 ! !
 

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk