Smalltalk › Gnu

[Q] Bug in EncodedStream?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Chun, Sungjin

[Q] Bug in EncodedStream?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

When I run following:

(I18N.EncodedStream encoding: (UnicodeString fromString: '전성진'))
contents !

gst emits endless messages related to garbage collecting then crashes
with segmentation faults. The content of the string is UTF-8 encoded
Korean text(9 byte, 3 characters).

And, are there any simple example for processing UTF-8 encoded string?

Thanks in advance.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFMzgrQqspS1+XJHgRAm4bAJoCCY4J1SiT6yloR54qlIcjeoIplgCeIy3t
JoLjMRkAijV6ZoxBI+exYV4=
=kVE1
-----END PGP SIGNATURE-----

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: [Q] Bug in EncodedStream?

Sungjin Chun wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> When I run following:
>
> (I18N.EncodedStream encoding: (UnicodeString fromString: '전성진'))
> contents !
>
> gst emits endless messages related to garbage collecting then crashes
> with segmentation faults.

Yes, it is a stupid bug. When using the system function iconv, gst has

to split the UnicodeCharacters back into 8-bit Characters, and here it
gets stuck in an infinite loop. The first character for example is
$<16rC804>, and the "C8" byte is created as a UnicodeCharacter rather
than a Character. This causes a recursive creation of another
I18N.EncodedStream.

The attached patch fixes the bug; thanks for reporting it.

In my testing, I only used Eastern-European characters where all bytes
are < 0x80.
> And, are there any simple example for processing UTF-8 encoded string?
>
Can you expand?

Paolo

--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
@@ -718,13 +718,13 @@ next
been extracted."
wch := answer := self nextInput codePoint.
wch := (wch bitShift: -8) + 16r1000000.
- ^(answer bitAnd: 255) asCharacter
+ ^Character value: (answer bitAnd: 255)
].

"Answer any other byte"
answer := wch bitAnd: 255.
wch := wch bitShift: -8.
- ^answer asCharacter
+ ^Character value: answer
!

flush
@@ -754,7 +754,7 @@ next
wch := answer := self nextInput codePoint.
wch := wch bitAnd: 16rFFFFFF.
count := 3.
- ^(answer bitShift: -24) asCharacter
+ ^Character value: (answer bitShift: -24)
].

"Answer any other byte. We keep things so that the byte we answer
@@ -763,7 +763,7 @@ next
wch := wch bitAnd: 16rFFFF.
wch := wch bitShift: 8.
count := count - 1.
- ^answer asCharacter
+ ^Character value: answer
!

flush

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Chun, Sungjin

Re: [Q] Bug in EncodedStream?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Paolo Bonzini wrote:

>> And, are there any simple example for processing UTF-8 encoded string?
>>
> Can you expand?
>
> Paolo

I mean that I want example code which shows good pattern on dealing
multibyte string :-) For example, I'm not sure whether this code is good
or not:

str _ UnicodeString fromString: 'Some UTF-8 Encoded String'.

It seems that

str _ UnicodeString fromString: 'Some UTF-8 Encoded String' encoding:
UTF8StringEncoding.

like code is better. (because I can let UnicodeString know the exact
encoding of given string or array of bytes.)

Thanks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFM14tQqspS1+XJHgRAtzXAJ98DIvDL40F++aV7qgRywYeQfo1MwCfd4yz
G5F7YsjMIP4MCpLIkZy/o0M=
=1eL+
-----END PGP SIGNATURE-----

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini

Re: {Spam?} Re: [Q] Bug in EncodedStream?

> I mean that I want example code which shows good pattern on dealing
> multibyte string :-) For example, I'm not sure whether this code is good
> or not:
>
> str _ UnicodeString fromString: 'Some UTF-8 Encoded String'.
>
It is if your default encoding is UTF-8, or if the encoded string
includes a byte-order mark (for this, you need the attached patch :-( ...).

For example, this works:

st> #[254 255 200 4 193 49 201 196] asString encoding!
'UTF-16BE'
> str _ UnicodeString fromString: 'Some UTF-8 Encoded String' encoding:
> UTF8StringEncoding.
>
UTF8StringEncoding is written 'UTF-8'.

Paolo

* auto-adding [hidden email]--2004b/smalltalk--devo--2.2--patch-152 to greedy revision library /Users/bonzinip/Archives/revlib
* found immediate ancestor revision in library ([hidden email]--2004b/smalltalk--devo--2.2--patch-151)
* patching for this revision ([hidden email]--2004b/smalltalk--devo--2.2--patch-152)
--- orig/i18n/Sets.st
+++ mod/i18n/Sets.st
@@ -1289,21 +1289,21 @@ encoding
default locale's default charset"

| encoding |
- (self size >= 4 and: [ (self at: 1) = 0 and: [ (self at: 2) = 0 and: [
- (self at: 3) = 254 and: [
- (self at: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ].
- (self size >= 4 and: [ (self at: 4) = 0 and: [ (self at: 3) = 0 and: [
- (self at: 2) = 254 and: [
- (self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
+ (self size >= 4 and: [ (self valueAt: 1) = 0 and: [ (self valueAt: 2) = 0 and: [
+ (self valueAt: 3) = 254 and: [
+ (self valueAt: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ].
+ (self size >= 4 and: [ (self valueAt: 4) = 0 and: [ (self valueAt: 3) = 0 and: [
+ (self valueAt: 2) = 254 and: [
+ (self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
(self size >= 2 and: [
- (self at: 1) = 254 and: [
- (self at: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ].
+ (self valueAt: 1) = 254 and: [
+ (self valueAt: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ].
(self size >= 2 and: [
- (self at: 2) = 254 and: [
- (self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
- (self size >= 3 and: [ (self at: 1) = 16rEF and: [
- (self at: 2) = 16rBB and: [
- (self at: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ].
+ (self valueAt: 2) = 254 and: [
+ (self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
+ (self size >= 3 and: [ (self valueAt: 1) = 16rEF and: [
+ (self valueAt: 2) = 16rBB and: [
+ (self valueAt: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ].

encoding := self class defaultEncoding.
encoding asString = 'UTF-16' ifTrue: [ ^self utf16Encoding ].
@@ -1314,9 +1314,9 @@ utf32Encoding
"Assuming the receiver is encoded as UTF-16 with a proper
endianness marker, answer the correct encoding of the receiver."

- (self size >= 4 and: [ (self at: 4) = 0 and: [ (self at: 3) = 0 and: [
- (self at: 2) = 254 and: [
- (self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
+ (self size >= 4 and: [ (self valueAt: 4) = 0 and: [ (self valueAt: 3) = 0 and: [
+ (self valueAt: 2) = 254 and: [
+ (self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ].
^'UTF-32BE'
!

@@ -1325,8 +1325,8 @@ utf16Encoding
endianness marker, answer the correct encoding of the receiver."

(self size >= 2 and: [
- (self at: 2) = 254 and: [
- (self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
+ (self valueAt: 2) = 254 and: [
+ (self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ].
^'UTF-16BE'
! !

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk