Smalltalk › Squeak › Squeak - Dev

The Inbox: Multilingual-pre.238.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

2 messages Options

commits-2

The Inbox: Multilingual-pre.238.mcz

Patrick Rein uploaded a new version of Multilingual to project The Inbox:
http://source.squeak.org/inbox/Multilingual-pre.238.mcz

==================== Summary ====================

Name: Multilingual-pre.238
Author: pre
Time: 26 May 2018, 10:01:56.809551 am
UUID: 3f43777d-957b-374e-8bc6-2254763879f5
Ancestors: Multilingual-nice.237

Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map.
Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required.

=============== Diff against Multilingual-nice.237 ===============

Item was changed:
----- Method: ByteTextConverter class>>initialize (in category 'class initialization') -----
initialize

self == ByteTextConverter
ifTrue: [self allSubclassesDo: [:c | c initialize]]
+ ifFalse: [self
- ifFalse: [self
initializeDecodeTable;
initializeEncodeTable;
initializeLatin1MapAndEncodings]
!

Item was added:
+ ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') -----
+ initializeLatin1MapAndEncodings
+ "Initialize the latin1Map and latin1Encodings.
+ These variables ensure that conversions from latin1 ByteString is reasonably fast"
+
+ latin1Map := (ByteArray new: 256) atAllPut: 1.
+ latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]!

Item was changed:
----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') -----
decodeByteString: aByteString
"Convert the given string from UTF-8 using the fast path if converting to Latin-1"

+ | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask |
- | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode |
lastIndex := 1.
(nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0
ifTrue: [ ^aByteString ].
limit := aByteString size.
outStream := (String new: limit) writeStream.
+ continuationByteMask := 2r00111111.
[
outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex.
byte1 := aByteString byteAt: nextIndex.
+
+ "The byte range checks are separated into single checks to allow for implementing recovery --pre
+ For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7"
+ (byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes"
+ nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ].
- (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes"
- nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask"
+ (byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString].
+ unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)].
+
+ (byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes"
- (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)].
- (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes"
(nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ ((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
+ ((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [
+ "reserved codepoints"
+ ^self errorMalformedInput: aByteString ].
byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6)
+ + (byte3 bitAnd: continuationByteMask)].
+
+ (byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes"
- unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6)
- + (byte3 bitAnd: 63)].
- (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes"
(nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ ((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
byte4 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) +
+ ((byte2 bitAnd: continuationByteMask) bitShift: 12) +
+ ((byte3 bitAnd: continuationByteMask) bitShift: 6) +
+ (byte4 bitAnd: continuationByteMask)].
+
- unicode := ((byte1 bitAnd: 16r7) bitShift: 18) +
- ((byte2 bitAnd: 63) bitShift: 12) +
- ((byte3 bitAnd: 63) bitShift: 6) +
- (byte4 bitAnd: 63)].
unicode ifNil: [ ^self errorMalformedInput: aByteString ].
unicode = 16rFEFF ifFalse: [ "Skip byte order mark"
outStream nextPut: (Unicode value: unicode) ].
lastIndex := nextIndex + 1.
(nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse.
^outStream
next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex;
contents
!

Patrick R.

Re: The Inbox: Multilingual-pre.238.mcz

Hi everyone,

this commit is in the inbox as it changes the style of the conversion method extensively and I would like someone to review that. I personally think that having the specific bitmaps written out helps reading the code, given you know how UTF-8 works.

Bests
Patrick
________________________________________
From: Squeak-dev <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Saturday, May 26, 2018 10:02
To: [hidden email]
Subject: [squeak-dev] The Inbox: Multilingual-pre.238.mcz

Patrick Rein uploaded a new version of Multilingual to project The Inbox:
http://source.squeak.org/inbox/Multilingual-pre.238.mcz

==================== Summary ====================

Name: Multilingual-pre.238
Author: pre
Time: 26 May 2018, 10:01:56.809551 am
UUID: 3f43777d-957b-374e-8bc6-2254763879f5
Ancestors: Multilingual-nice.237

Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map.
Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required.

=============== Diff against Multilingual-nice.237 ===============

Item was changed:
----- Method: ByteTextConverter class>>initialize (in category 'class initialization') -----
initialize

self == ByteTextConverter
ifTrue: [self allSubclassesDo: [:c | c initialize]]
+ ifFalse: [self
- ifFalse: [self
initializeDecodeTable;
initializeEncodeTable;
initializeLatin1MapAndEncodings]
!

Item was added:
+ ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') -----
+ initializeLatin1MapAndEncodings
+ "Initialize the latin1Map and latin1Encodings.
+ These variables ensure that conversions from latin1 ByteString is reasonably fast"
+
+ latin1Map := (ByteArray new: 256) atAllPut: 1.
+ latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]!

Item was changed:
----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') -----
decodeByteString: aByteString
"Convert the given string from UTF-8 using the fast path if converting to Latin-1"

+ | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask |
- | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode |
lastIndex := 1.
(nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0
ifTrue: [ ^aByteString ].
limit := aByteString size.
outStream := (String new: limit) writeStream.
+ continuationByteMask := 2r00111111.
[
outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex.
byte1 := aByteString byteAt: nextIndex.
+
+ "The byte range checks are separated into single checks to allow for implementing recovery --pre
+ For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7"
+ (byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes"
+ nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ].
- (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes"
- nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask"
+ (byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString].
+ unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)].
+
+ (byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes"
- (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)].
- (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes"
(nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ ((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
+ ((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [
+ "reserved codepoints"
+ ^self errorMalformedInput: aByteString ].
byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6)
+ + (byte3 bitAnd: continuationByteMask)].
+
+ (byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes"
- unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6)
- + (byte3 bitAnd: 63)].
- (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes"
(nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ ((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
byte4 := aByteString byteAt: (nextIndex := nextIndex + 1).
(byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) +
+ ((byte2 bitAnd: continuationByteMask) bitShift: 12) +
+ ((byte3 bitAnd: continuationByteMask) bitShift: 6) +
+ (byte4 bitAnd: continuationByteMask)].
+
- unicode := ((byte1 bitAnd: 16r7) bitShift: 18) +
- ((byte2 bitAnd: 63) bitShift: 12) +
- ((byte3 bitAnd: 63) bitShift: 6) +
- (byte4 bitAnd: 63)].
unicode ifNil: [ ^self errorMalformedInput: aByteString ].
unicode = 16rFEFF ifFalse: [ "Skip byte order mark"
outStream nextPut: (Unicode value: unicode) ].
lastIndex := nextIndex + 1.
(nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse.
^outStream
next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex;
contents
!