Patrick Rein uploaded a new version of Multilingual to project The Inbox:
http://source.squeak.org/inbox/Multilingual-pre.238.mcz ==================== Summary ==================== Name: Multilingual-pre.238 Author: pre Time: 26 May 2018, 10:01:56.809551 am UUID: 3f43777d-957b-374e-8bc6-2254763879f5 Ancestors: Multilingual-nice.237 Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map. Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required. =============== Diff against Multilingual-nice.237 =============== Item was changed: ----- Method: ByteTextConverter class>>initialize (in category 'class initialization') ----- initialize self == ByteTextConverter ifTrue: [self allSubclassesDo: [:c | c initialize]] + ifFalse: [self - ifFalse: [self initializeDecodeTable; initializeEncodeTable; initializeLatin1MapAndEncodings] ! Item was added: + ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') ----- + initializeLatin1MapAndEncodings + "Initialize the latin1Map and latin1Encodings. + These variables ensure that conversions from latin1 ByteString is reasonably fast" + + latin1Map := (ByteArray new: 256) atAllPut: 1. + latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]! Item was changed: ----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') ----- decodeByteString: aByteString "Convert the given string from UTF-8 using the fast path if converting to Latin-1" + | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask | - | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode | lastIndex := 1. (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ifTrue: [ ^aByteString ]. limit := aByteString size. outStream := (String new: limit) writeStream. + continuationByteMask := 2r00111111. [ outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex. byte1 := aByteString byteAt: nextIndex. + + "The byte range checks are separated into single checks to allow for implementing recovery --pre + For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7" + (byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes" + nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ]. - (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes" - nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ]. byte2 := aByteString byteAt: (nextIndex := nextIndex + 1). + (byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask" + (byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString]. + unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)]. + + (byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes" - (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. - unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)]. - (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes" (nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ]. byte2 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + ((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ]. + ((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [ + "reserved codepoints" + ^self errorMalformedInput: aByteString ]. byte3 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6) + + (byte3 bitAnd: continuationByteMask)]. + + (byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes" - unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6) - + (byte3 bitAnd: 63)]. - (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes" (nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ]. byte2 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + ((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ]. byte3 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. byte4 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) + + ((byte2 bitAnd: continuationByteMask) bitShift: 12) + + ((byte3 bitAnd: continuationByteMask) bitShift: 6) + + (byte4 bitAnd: continuationByteMask)]. + - unicode := ((byte1 bitAnd: 16r7) bitShift: 18) + - ((byte2 bitAnd: 63) bitShift: 12) + - ((byte3 bitAnd: 63) bitShift: 6) + - (byte4 bitAnd: 63)]. unicode ifNil: [ ^self errorMalformedInput: aByteString ]. unicode = 16rFEFF ifFalse: [ "Skip byte order mark" outStream nextPut: (Unicode value: unicode) ]. lastIndex := nextIndex + 1. (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse. ^outStream next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex; contents ! |
Hi everyone,
this commit is in the inbox as it changes the style of the conversion method extensively and I would like someone to review that. I personally think that having the specific bitmaps written out helps reading the code, given you know how UTF-8 works. Bests Patrick ________________________________________ From: Squeak-dev <[hidden email]> on behalf of [hidden email] <[hidden email]> Sent: Saturday, May 26, 2018 10:02 To: [hidden email] Subject: [squeak-dev] The Inbox: Multilingual-pre.238.mcz Patrick Rein uploaded a new version of Multilingual to project The Inbox: http://source.squeak.org/inbox/Multilingual-pre.238.mcz ==================== Summary ==================== Name: Multilingual-pre.238 Author: pre Time: 26 May 2018, 10:01:56.809551 am UUID: 3f43777d-957b-374e-8bc6-2254763879f5 Ancestors: Multilingual-nice.237 Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map. Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required. =============== Diff against Multilingual-nice.237 =============== Item was changed: ----- Method: ByteTextConverter class>>initialize (in category 'class initialization') ----- initialize self == ByteTextConverter ifTrue: [self allSubclassesDo: [:c | c initialize]] + ifFalse: [self - ifFalse: [self initializeDecodeTable; initializeEncodeTable; initializeLatin1MapAndEncodings] ! Item was added: + ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') ----- + initializeLatin1MapAndEncodings + "Initialize the latin1Map and latin1Encodings. + These variables ensure that conversions from latin1 ByteString is reasonably fast" + + latin1Map := (ByteArray new: 256) atAllPut: 1. + latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]! Item was changed: ----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') ----- decodeByteString: aByteString "Convert the given string from UTF-8 using the fast path if converting to Latin-1" + | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask | - | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode | lastIndex := 1. (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ifTrue: [ ^aByteString ]. limit := aByteString size. outStream := (String new: limit) writeStream. + continuationByteMask := 2r00111111. [ outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex. byte1 := aByteString byteAt: nextIndex. + + "The byte range checks are separated into single checks to allow for implementing recovery --pre + For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7" + (byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes" + nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ]. - (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes" - nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ]. byte2 := aByteString byteAt: (nextIndex := nextIndex + 1). + (byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask" + (byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString]. + unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)]. + + (byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes" - (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. - unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)]. - (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes" (nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ]. byte2 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + ((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ]. + ((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [ + "reserved codepoints" + ^self errorMalformedInput: aByteString ]. byte3 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6) + + (byte3 bitAnd: continuationByteMask)]. + + (byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes" - unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6) - + (byte3 bitAnd: 63)]. - (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes" (nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ]. byte2 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + ((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ]. byte3 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. byte4 := aByteString byteAt: (nextIndex := nextIndex + 1). (byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ]. + unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) + + ((byte2 bitAnd: continuationByteMask) bitShift: 12) + + ((byte3 bitAnd: continuationByteMask) bitShift: 6) + + (byte4 bitAnd: continuationByteMask)]. + - unicode := ((byte1 bitAnd: 16r7) bitShift: 18) + - ((byte2 bitAnd: 63) bitShift: 12) + - ((byte3 bitAnd: 63) bitShift: 6) + - (byte4 bitAnd: 63)]. unicode ifNil: [ ^self errorMalformedInput: aByteString ]. unicode = 16rFEFF ifFalse: [ "Skip byte order mark" outStream nextPut: (Unicode value: unicode) ]. lastIndex := nextIndex + 1. (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse. ^outStream next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex; contents ! |
Free forum by Nabble | Edit this page |