The Inbox: Multilingual-pre.238.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

The Inbox: Multilingual-pre.238.mcz

commits-2
Patrick Rein uploaded a new version of Multilingual to project The Inbox:
http://source.squeak.org/inbox/Multilingual-pre.238.mcz

==================== Summary ====================

Name: Multilingual-pre.238
Author: pre
Time: 26 May 2018, 10:01:56.809551 am
UUID: 3f43777d-957b-374e-8bc6-2254763879f5
Ancestors: Multilingual-nice.237

Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map.
Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required.

=============== Diff against Multilingual-nice.237 ===============

Item was changed:
  ----- Method: ByteTextConverter class>>initialize (in category 'class initialization') -----
  initialize
 
        self == ByteTextConverter
  ifTrue: [self allSubclassesDo: [:c | c initialize]]
+ ifFalse: [self
- ifFalse: [self
  initializeDecodeTable;
  initializeEncodeTable;
  initializeLatin1MapAndEncodings]
  !

Item was added:
+ ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') -----
+ initializeLatin1MapAndEncodings
+ "Initialize the latin1Map and latin1Encodings.
+ These variables ensure that conversions from latin1 ByteString is reasonably fast"
+
+ latin1Map := (ByteArray new: 256) atAllPut: 1.
+ latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]!

Item was changed:
  ----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') -----
  decodeByteString: aByteString
  "Convert the given string from UTF-8 using the fast path if converting to Latin-1"
 
+ | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask |
- | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode |
  lastIndex := 1.
  (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0
  ifTrue: [ ^aByteString ].
  limit := aByteString size.
  outStream := (String new: limit) writeStream.
+ continuationByteMask := 2r00111111.
  [
  outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex.
  byte1 := aByteString byteAt: nextIndex.
+
+ "The byte range checks are separated into single checks to allow for implementing recovery --pre
+ For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7"
+ (byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes"
+ nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ].
- (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes"
- nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
  byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask"
+ (byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString].  
+ unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)].
+
+ (byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes"
- (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)].
- (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes"
  (nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
  byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
  (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ ((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
+ ((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [
+ "reserved codepoints"
+ ^self errorMalformedInput: aByteString ].
  byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
  (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6)
+ + (byte3 bitAnd: continuationByteMask)].
+
+ (byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes"
- unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6)
- + (byte3 bitAnd: 63)].
- (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes"
  (nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
  byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
  (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ ((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
  byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
  (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
  byte4 := aByteString byteAt: (nextIndex := nextIndex + 1).
  (byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) +
+ ((byte2 bitAnd: continuationByteMask) bitShift: 12) +
+ ((byte3 bitAnd: continuationByteMask) bitShift: 6) +
+ (byte4 bitAnd: continuationByteMask)].
+
- unicode := ((byte1 bitAnd: 16r7) bitShift: 18) +
- ((byte2 bitAnd: 63) bitShift: 12) +
- ((byte3 bitAnd: 63) bitShift: 6) +
- (byte4 bitAnd: 63)].
  unicode ifNil: [ ^self errorMalformedInput: aByteString ].
  unicode = 16rFEFF ifFalse: [ "Skip byte order mark"
  outStream nextPut: (Unicode value: unicode) ].
  lastIndex := nextIndex + 1.
  (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse.
  ^outStream
  next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex;
  contents
  !


Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Multilingual-pre.238.mcz

Patrick R.
Hi everyone,

this commit is in the inbox as it changes the style of the conversion method extensively and I would like someone to review that. I personally think that having the specific bitmaps written out helps reading the code, given you know how UTF-8 works.

Bests
Patrick
________________________________________
From: Squeak-dev <[hidden email]> on behalf of [hidden email] <[hidden email]>
Sent: Saturday, May 26, 2018 10:02
To: [hidden email]
Subject: [squeak-dev] The Inbox: Multilingual-pre.238.mcz

Patrick Rein uploaded a new version of Multilingual to project The Inbox:
http://source.squeak.org/inbox/Multilingual-pre.238.mcz

==================== Summary ====================

Name: Multilingual-pre.238
Author: pre
Time: 26 May 2018, 10:01:56.809551 am
UUID: 3f43777d-957b-374e-8bc6-2254763879f5
Ancestors: Multilingual-nice.237

Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map.
Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required.

=============== Diff against Multilingual-nice.237 ===============

Item was changed:
  ----- Method: ByteTextConverter class>>initialize (in category 'class initialization') -----
  initialize

        self == ByteTextConverter
                ifTrue: [self allSubclassesDo: [:c | c initialize]]
+               ifFalse: [self
-               ifFalse: [self
                                        initializeDecodeTable;
                                        initializeEncodeTable;
                                        initializeLatin1MapAndEncodings]
  !

Item was added:
+ ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') -----
+ initializeLatin1MapAndEncodings
+       "Initialize the latin1Map and latin1Encodings.
+       These variables ensure that conversions from latin1 ByteString is reasonably fast"
+
+       latin1Map := (ByteArray new: 256) atAllPut: 1.
+       latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]!

Item was changed:
  ----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') -----
  decodeByteString: aByteString
        "Convert the given string from UTF-8 using the fast path if converting to Latin-1"

+       | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask |
-       | outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode |
        lastIndex := 1.
        (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0
                ifTrue: [ ^aByteString ].
        limit := aByteString size.
        outStream := (String new: limit) writeStream.
+       continuationByteMask := 2r00111111.
        [
                outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex.
                byte1 := aByteString byteAt: nextIndex.
+
+               "The byte range checks are separated into single checks to allow for implementing recovery --pre
+               For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7"
+               (byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes"
+                       nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ].
-               (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes"
-                       nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
                        byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+                       (byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask"
+                       (byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString].
+                       unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)].
+
+               (byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes"
-                       (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
-                       unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)].
-               (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes"
                        (nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
                        byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
                        (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+                       ((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
+                       ((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [
+                               "reserved codepoints"
+                               ^self errorMalformedInput: aByteString ].
                        byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
                        (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+                       unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6)
+                               + (byte3 bitAnd: continuationByteMask)].
+
+               (byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes"
-                       unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6)
-                               + (byte3 bitAnd: 63)].
-               (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes"
                        (nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
                        byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
                        (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+                       ((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
                        byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
                        (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
                        byte4 := aByteString byteAt: (nextIndex := nextIndex + 1).
                        (byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+                       unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) +
+                                                       ((byte2 bitAnd: continuationByteMask) bitShift: 12) +
+                                                       ((byte3 bitAnd: continuationByteMask) bitShift: 6) +
+                                                       (byte4 bitAnd: continuationByteMask)].
+
-                       unicode := ((byte1 bitAnd: 16r7) bitShift: 18) +
-                                                       ((byte2 bitAnd: 63) bitShift: 12) +
-                                                       ((byte3 bitAnd: 63) bitShift: 6) +
-                                                       (byte4 bitAnd: 63)].
                unicode ifNil: [ ^self errorMalformedInput: aByteString ].
                unicode = 16rFEFF ifFalse: [ "Skip byte order mark"
                        outStream nextPut: (Unicode value: unicode) ].
                lastIndex := nextIndex + 1.
                (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse.
        ^outStream
                next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex;
                contents
  !