The Trunk: Multilingual-ar.126.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: Multilingual-ar.126.mcz

commits-2
Andreas Raab uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ar.126.mcz

==================== Summary ====================

Name: Multilingual-ar.126
Author: ar
Time: 1 September 2010, 6:47:21.978 pm
UUID: f6128173-e8e1-374f-950f-cec8ec4c2194
Ancestors: Multilingual-nice.125

Add a preference controlling the strictness of UTF-8 conversions. While it is good to be strict in development, it can be very painful to be overly strict in production settings when the input is an external source with uncertain encoding.

=============== Diff against Multilingual-nice.125 ===============

Item was changed:
  ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
  nextFromStream: aStream
 
+ | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
+ aStream isBinary ifTrue: [^ aStream basicNext].
+ char1 := aStream basicNext.
+ char1 ifNil:[^ nil].
+ value1 := char1 asciiValue.
+ value1 <= 127 ifTrue: [
+ "1-byte char"
+ ^ char1
+ ].
+
+ "at least 2-byte char"
+ char2 := aStream basicNext.
+ char2 ifNil:[^self errorMalformedInput: (String with: char1)].
+ value2 := char2 asciiValue.
+
- | character value1 value2 unicode value3 value4 |
- aStream isBinary ifTrue: [ ^aStream basicNext ].
- character := aStream basicNext ifNil: [ ^nil ].
- value1 := character asciiValue.
- value1 <= 127 ifTrue: [ ^character ]. "1-byte character"
- "at least 2-byte character"
- character := aStream basicNext ifNil: [ ^self errorMalformedInput ].
- value2 := character asciiValue.
  (value1 bitAnd: 16rE0) = 192 ifTrue: [
+ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
+ ].
+
+ "at least 3-byte char"
+ char3 := aStream basicNext.
+ char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
+ value3 := char3 asciiValue.
- ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63) ].
- "at least 3-byte character"
- character := aStream basicNext ifNil: [ ^self errorMalformedInput ].
- value3 := character asciiValue.
  (value1 bitAnd: 16rF0) = 224 ifTrue: [
  unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
+ + (value3 bitAnd: 63).
+ ].
+
- + (value3 bitAnd: 63) ].
  (value1 bitAnd: 16rF8) = 240 ifTrue: [
+ "4-byte char"
+ char4 := aStream basicNext.
+ char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
+ value4 := char4 asciiValue.
- "4-byte character"
- character := aStream basicNext ifNil: [ ^self errorMalformedInput ].
- value4 := character asciiValue.
  unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
  ((value2 bitAnd: 63) bitShift: 12) +
  ((value3 bitAnd: 63) bitShift: 6) +
+ (value4 bitAnd: 63).
+ ].
+
+ unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
+ unicode > 16r10FFFD ifTrue: [
+ ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
+ ].
+
+ unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
- (value4 bitAnd: 63) ].
- unicode ifNil: [ ^self errorMalformedInput ].
- unicode > 16r10FFFD ifTrue: [ ^self errorMalformedInput ].
- unicode = 16rFEFF ifTrue: [ ^self nextFromStream: aStream ].
  ^ Unicode value: unicode.
  !

Item was added:
+ ----- Method: UTF8TextConverter>>errorMalformedInput: (in category 'conversion') -----
+ errorMalformedInput: aString
+ "Invalid UTF-8 input has been detected in the given string.
+ Raise an error if strict conversions are enabled, otherwise allow
+ the original string to be returned."
+
+ ^self class errorMalformedInput: aString!

Item was changed:
  TextConverter subclass: #UTF8TextConverter
  instanceVariableNames: ''
+ classVariableNames: 'StrictUtf8Conversions'
- classVariableNames: ''
  poolDictionaries: ''
  category: 'Multilingual-TextConversion'!
 
  !UTF8TextConverter commentStamp: '<historical>' prior: 0!
  Text converter for UTF-8.  Since the BOM is used to distinguish the MacRoman code and UTF-8 code, BOM is written for UTF-8 by #writeBOMOn: which is called by client.!

Item was changed:
  ----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') -----
  decodeByteString: aByteString
  "Convert the given string from UTF-8 using the fast path if converting to Latin-1"
 
  | outStream lastIndex nextIndex byte1 byte2 byte3 byte4 unicode |
  lastIndex := 1.
  (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0
  ifTrue: [ ^aByteString ].
  outStream := (String new: aByteString size) writeStream.
  [
  outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex.
  byte1 := aByteString byteAt: nextIndex.
  (byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes"
  byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput ].
  unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)].
  (byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes"
  byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput ].
  byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput ].
  unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6)
  + (byte3 bitAnd: 63)].
  (byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes"
  byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- (byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput ].
  byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- (byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput ].
  byte4 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ (byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
- (byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput ].
  unicode := ((byte1 bitAnd: 16r7) bitShift: 18) +
  ((byte2 bitAnd: 63) bitShift: 12) +
  ((byte3 bitAnd: 63) bitShift: 6) +
  (byte4 bitAnd: 63)].
+ unicode ifNil: [ ^self errorMalformedInput: aByteString ].
- unicode ifNil: [ ^self errorMalformedInput ].
  unicode = 16rFEFF ifFalse: [ "Skip byte order mark"
  outStream nextPut: (Unicode value: unicode) ].
  lastIndex := nextIndex + 1.
  (nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse.
  ^outStream
  next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex;
  contents
  !

Item was added:
+ ----- Method: UTF8TextConverter class>>strictUtf8Conversions (in category 'utilities') -----
+ strictUtf8Conversions
+ "Preference setter for strict utf-8 conversions"
+
+ <preference: 'Strict utf8 conversions'
+ category: 'general' "should this be localization?"
+ description: 'If true, invalid utf-8 input will raise errors. If false, invalid utf-8 input will be allowed to pass through the conversion unchanged'
+ type: #Boolean>
+
+ ^StrictUtf8Conversions ifNil:[true]
+ !

Item was added:
+ ----- Method: UTF8TextConverter class>>errorMalformedInput: (in category 'utilities') -----
+ errorMalformedInput: aString
+ "Invalid UTF-8 input has been detected in the given string.
+ Raise an error if strict conversions are enabled, otherwise allow
+ the original string to be returned."
+
+ self strictUtf8Conversions ifTrue:[
+ self error: 'Invalid utf8: ', aString
+ ].
+
+ ^aString!

Item was added:
+ ----- Method: UTF8TextConverter class>>strictUtf8Conversions: (in category 'utilities') -----
+ strictUtf8Conversions: aBool
+ "Preference setter for strict utf-8 conversions"
+
+ StrictUtf8Conversions := aBool.
+
+ !

Item was removed:
- ----- Method: UTF8TextConverter>>errorMalformedInput (in category 'conversion') -----
- errorMalformedInput
-
- ^self class errorMalformedInput!

Item was removed:
- ----- Method: UTF8TextConverter class>>errorMalformedInput (in category 'utilities') -----
- errorMalformedInput
-
- ^self error: 'Invalid utf8 input detected'!