-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Hi, When I run following: (I18N.EncodedStream encoding: (UnicodeString fromString: '전성진')) contents ! gst emits endless messages related to garbage collecting then crashes with segmentation faults. The content of the string is UTF-8 encoded Korean text(9 byte, 3 characters). And, are there any simple example for processing UTF-8 encoded string? Thanks in advance. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFMzgrQqspS1+XJHgRAm4bAJoCCY4J1SiT6yloR54qlIcjeoIplgCeIy3t JoLjMRkAijV6ZoxBI+exYV4= =kVE1 -----END PGP SIGNATURE----- _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
Sungjin Chun wrote:
> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > When I run following: > > (I18N.EncodedStream encoding: (UnicodeString fromString: '전성진')) > contents ! > > gst emits endless messages related to garbage collecting then crashes > with segmentation faults. gets stuck in an infinite loop. The first character for example is $<16rC804>, and the "C8" byte is created as a UnicodeCharacter rather than a Character. This causes a recursive creation of another I18N.EncodedStream. The attached patch fixes the bug; thanks for reporting it. In my testing, I only used Eastern-European characters where all bytes are < 0x80. > And, are there any simple example for processing UTF-8 encoded string? > Can you expand? Paolo --- orig/i18n/Sets.st +++ mod/i18n/Sets.st @@ -718,13 +718,13 @@ next been extracted." wch := answer := self nextInput codePoint. wch := (wch bitShift: -8) + 16r1000000. - ^(answer bitAnd: 255) asCharacter + ^Character value: (answer bitAnd: 255) ]. "Answer any other byte" answer := wch bitAnd: 255. wch := wch bitShift: -8. - ^answer asCharacter + ^Character value: answer ! flush @@ -754,7 +754,7 @@ next wch := answer := self nextInput codePoint. wch := wch bitAnd: 16rFFFFFF. count := 3. - ^(answer bitShift: -24) asCharacter + ^Character value: (answer bitShift: -24) ]. "Answer any other byte. We keep things so that the byte we answer @@ -763,7 +763,7 @@ next wch := wch bitAnd: 16rFFFF. wch := wch bitShift: 8. count := count - 1. - ^answer asCharacter + ^Character value: answer ! flush _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Paolo Bonzini wrote: >> And, are there any simple example for processing UTF-8 encoded string? >> > Can you expand? > > Paolo I mean that I want example code which shows good pattern on dealing multibyte string :-) For example, I'm not sure whether this code is good or not: str _ UnicodeString fromString: 'Some UTF-8 Encoded String'. It seems that str _ UnicodeString fromString: 'Some UTF-8 Encoded String' encoding: UTF8StringEncoding. like code is better. (because I can let UnicodeString know the exact encoding of given string or array of bytes.) Thanks. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFM14tQqspS1+XJHgRAtzXAJ98DIvDL40F++aV7qgRywYeQfo1MwCfd4yz G5F7YsjMIP4MCpLIkZy/o0M= =1eL+ -----END PGP SIGNATURE----- _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
> I mean that I want example code which shows good pattern on dealing > multibyte string :-) For example, I'm not sure whether this code is good > or not: > > str _ UnicodeString fromString: 'Some UTF-8 Encoded String'. > It is if your default encoding is UTF-8, or if the encoded string includes a byte-order mark (for this, you need the attached patch :-( ...). For example, this works: st> #[254 255 200 4 193 49 201 196] asString encoding! 'UTF-16BE' > str _ UnicodeString fromString: 'Some UTF-8 Encoded String' encoding: > UTF8StringEncoding. > UTF8StringEncoding is written 'UTF-8'. Paolo * auto-adding [hidden email]--2004b/smalltalk--devo--2.2--patch-152 to greedy revision library /Users/bonzinip/Archives/revlib * found immediate ancestor revision in library ([hidden email]--2004b/smalltalk--devo--2.2--patch-151) * patching for this revision ([hidden email]--2004b/smalltalk--devo--2.2--patch-152) --- orig/i18n/Sets.st +++ mod/i18n/Sets.st @@ -1289,21 +1289,21 @@ encoding default locale's default charset" | encoding | - (self size >= 4 and: [ (self at: 1) = 0 and: [ (self at: 2) = 0 and: [ - (self at: 3) = 254 and: [ - (self at: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ]. - (self size >= 4 and: [ (self at: 4) = 0 and: [ (self at: 3) = 0 and: [ - (self at: 2) = 254 and: [ - (self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ]. + (self size >= 4 and: [ (self valueAt: 1) = 0 and: [ (self valueAt: 2) = 0 and: [ + (self valueAt: 3) = 254 and: [ + (self valueAt: 4) = 255 ]]]]) ifTrue: [ ^'UTF-32BE' ]. + (self size >= 4 and: [ (self valueAt: 4) = 0 and: [ (self valueAt: 3) = 0 and: [ + (self valueAt: 2) = 254 and: [ + (self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ]. (self size >= 2 and: [ - (self at: 1) = 254 and: [ - (self at: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ]. + (self valueAt: 1) = 254 and: [ + (self valueAt: 2) = 255 ]]) ifTrue: [ ^'UTF-16BE' ]. (self size >= 2 and: [ - (self at: 2) = 254 and: [ - (self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ]. - (self size >= 3 and: [ (self at: 1) = 16rEF and: [ - (self at: 2) = 16rBB and: [ - (self at: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ]. + (self valueAt: 2) = 254 and: [ + (self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ]. + (self size >= 3 and: [ (self valueAt: 1) = 16rEF and: [ + (self valueAt: 2) = 16rBB and: [ + (self valueAt: 3) = 16rBF ]]]) ifTrue: [ ^'UTF-8' ]. encoding := self class defaultEncoding. encoding asString = 'UTF-16' ifTrue: [ ^self utf16Encoding ]. @@ -1314,9 +1314,9 @@ utf32Encoding "Assuming the receiver is encoded as UTF-16 with a proper endianness marker, answer the correct encoding of the receiver." - (self size >= 4 and: [ (self at: 4) = 0 and: [ (self at: 3) = 0 and: [ - (self at: 2) = 254 and: [ - (self at: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ]. + (self size >= 4 and: [ (self valueAt: 4) = 0 and: [ (self valueAt: 3) = 0 and: [ + (self valueAt: 2) = 254 and: [ + (self valueAt: 1) = 255 ]]]]) ifTrue: [ ^'UTF-32LE' ]. ^'UTF-32BE' ! @@ -1325,8 +1325,8 @@ utf16Encoding endianness marker, answer the correct encoding of the receiver." (self size >= 2 and: [ - (self at: 2) = 254 and: [ - (self at: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ]. + (self valueAt: 2) = 254 and: [ + (self valueAt: 1) = 255 ]]) ifTrue: [ ^'UTF-16LE' ]. ^'UTF-16BE' ! ! _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
Free forum by Nabble | Edit this page |