Smalltalk › Squeak › Squeak - Dev

The Inbox: Multilingual-jr.218.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

12 messages Options

commits-2

The Inbox: Multilingual-jr.218.mcz

A new version of Multilingual was added to project The Inbox:
http://source.squeak.org/inbox/Multilingual-jr.218.mcz

==================== Summary ====================

Name: Multilingual-jr.218
Author: jr
Time: 19 January 2017, 5:14:23.763655 pm
UUID: 36416c42-a4b4-554f-8203-aba25eee794f
Ancestors: Multilingual-tfel.217

support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext

A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.

=============== Diff against Multilingual-tfel.217 ===============

Item was changed:
----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') -----
encodingNames

+ ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
- ^ #('latin-1' 'latin1') copy.
!

Item was changed:
----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
nextFromStream: aStream

| char1 value1 char2 value2 unicode char3 value3 char4 value4 |
aStream isBinary ifTrue: [^ aStream basicNext].
char1 := aStream basicNext.
char1 ifNil:[^ nil].
+ value1 := char1 asInteger.
- value1 := char1 asciiValue.
value1 <= 127 ifTrue: [
"1-byte char"
+ ^ char1 asCharacter
- ^ char1
].

"at least 2-byte char"
char2 := aStream basicNext.
+ char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
+ value2 := char2 asInteger.
- char2 ifNil:[^self errorMalformedInput: (String with: char1)].
- value2 := char2 asciiValue.

(value1 bitAnd: 16rE0) = 192 ifTrue: [
^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
].

"at least 3-byte char"
char3 := aStream basicNext.
+ char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
+ value3 := char3 asInteger.
- char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
- value3 := char3 asciiValue.
(value1 bitAnd: 16rF0) = 224 ifTrue: [
unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
+ (value3 bitAnd: 63).
].

(value1 bitAnd: 16rF8) = 240 ifTrue: [
"4-byte char"
char4 := aStream basicNext.
+ char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
+ value4 := char4 asInteger.
- char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
- value4 := char4 asciiValue.
unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
((value2 bitAnd: 63) bitShift: 12) +
((value3 bitAnd: 63) bitShift: 6) +
(value4 bitAnd: 63).
].

+ unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
- unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
unicode > 16r10FFFD ifTrue: [
+ ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
- ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
].

unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
^ Unicode value: unicode.
!

Tobias Pape

Re: The Inbox: Multilingual-jr.218.mcz

Thanks Jacob.

Any objections here I put this into trunk?
Looks good from here.

Best regards
-Tobias
On 19.01.2017, at 17:14, [hidden email] wrote:

> A new version of Multilingual was added to project The Inbox:
> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>
> ==================== Summary ====================
>
> Name: Multilingual-jr.218
> Author: jr
> Time: 19 January 2017, 5:14:23.763655 pm
> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
> Ancestors: Multilingual-tfel.217
>
> support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
>
> A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
>
> =============== Diff against Multilingual-tfel.217 ===============
>
> Item was changed:
> ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') -----
> encodingNames
>
> + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
> - ^ #('latin-1' 'latin1') copy.
> !
>
> Item was changed:
> ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
> nextFromStream: aStream
>
> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
> aStream isBinary ifTrue: [^ aStream basicNext].
> char1 := aStream basicNext.
> char1 ifNil:[^ nil].
> + value1 := char1 asInteger.
> - value1 := char1 asciiValue.
> value1 <= 127 ifTrue: [
> "1-byte char"
> + ^ char1 asCharacter
> - ^ char1
> ].
>
> "at least 2-byte char"
> char2 := aStream basicNext.
> + char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
> + value2 := char2 asInteger.
> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
> - value2 := char2 asciiValue.
>
> (value1 bitAnd: 16rE0) = 192 ifTrue: [
> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
> ].
>
> "at least 3-byte char"
> char3 := aStream basicNext.
> + char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
> + value3 := char3 asInteger.
> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
> - value3 := char3 asciiValue.
> (value1 bitAnd: 16rF0) = 224 ifTrue: [
> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
> + (value3 bitAnd: 63).
> ].
>
> (value1 bitAnd: 16rF8) = 240 ifTrue: [
> "4-byte char"
> char4 := aStream basicNext.
> + char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
> + value4 := char4 asInteger.
> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
> - value4 := char4 asciiValue.
> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
> ((value2 bitAnd: 63) bitShift: 12) +
> ((value3 bitAnd: 63) bitShift: 6) +
> (value4 bitAnd: 63).
> ].
>
> + unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
> unicode > 16r10FFFD ifTrue: [
> + ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
> - ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
> ].
>
> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
> ^ Unicode value: unicode.
> !
>
>

Levente Uzonyi

Re: The Inbox: Multilingual-jr.218.mcz

On Thu, 19 Jan 2017, Tobias Pape wrote:

> Thanks Jacob.
>
> Any objections here I put this into trunk?

Yep. TextConverters are intended to work with MultiByte*Streams only.
Therefore #basicNext is expected to return a Character, provided the
stream is not binary. This is why the #isBinary check is the first thing
the method does.

If there are plans to make TextConverters work with more general streams,
then I persume these changes won't be enough.

Levente

> Looks good from here.
>
> Best regards
> -Tobias
> On 19.01.2017, at 17:14, [hidden email] wrote:
>
>> A new version of Multilingual was added to project The Inbox:
>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Multilingual-jr.218
>> Author: jr
>> Time: 19 January 2017, 5:14:23.763655 pm
>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>> Ancestors: Multilingual-tfel.217
>>
>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
>>
>> A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
>>
>> =============== Diff against Multilingual-tfel.217 ===============
>>
>> Item was changed:
>> ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') -----
>> encodingNames
>>
>> + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>> - ^ #('latin-1' 'latin1') copy.
>> !
>>
>> Item was changed:
>> ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
>> nextFromStream: aStream
>>
>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>> aStream isBinary ifTrue: [^ aStream basicNext].
>> char1 := aStream basicNext.
>> char1 ifNil:[^ nil].
>> + value1 := char1 asInteger.
>> - value1 := char1 asciiValue.
>> value1 <= 127 ifTrue: [
>> "1-byte char"
>> + ^ char1 asCharacter
>> - ^ char1
>> ].
>>
>> "at least 2-byte char"
>> char2 := aStream basicNext.
>> + char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
>> + value2 := char2 asInteger.
>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>> - value2 := char2 asciiValue.
>>
>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
>> ].
>>
>> "at least 3-byte char"
>> char3 := aStream basicNext.
>> + char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
>> + value3 := char3 asInteger.
>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
>> - value3 := char3 asciiValue.
>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
>> + (value3 bitAnd: 63).
>> ].
>>
>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>> "4-byte char"
>> char4 := aStream basicNext.
>> + char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>> + value4 := char4 asInteger.
>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>> - value4 := char4 asciiValue.
>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>> ((value2 bitAnd: 63) bitShift: 12) +
>> ((value3 bitAnd: 63) bitShift: 6) +
>> (value4 bitAnd: 63).
>> ].
>>
>> + unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>> unicode > 16r10FFFD ifTrue: [
>> + ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
>> - ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
>> ].
>>
>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>> ^ Unicode value: unicode.
>> !
>>
>>

Tobias Pape

Re: The Inbox: Multilingual-jr.218.mcz

On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:

> On Thu, 19 Jan 2017, Tobias Pape wrote:
>
>> Thanks Jacob.
>>
>> Any objections here I put this into trunk?
>
> Yep. TextConverters are intended to work with MultiByte*Streams only.

Didn't know that.

> Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.

I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream.

I was confused to see that asciiValue returns something >127 in the first place.

>
> If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.

Clearly.
But isn't this a step in the right direction?

>
> Levente
>
>> Looks good from here.
>>
>> Best regards
>> -Tobias
>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>
>>> A new version of Multilingual was added to project The Inbox:
>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>> ==================== Summary ====================
>>> Name: Multilingual-jr.218
>>> Author: jr
>>> Time: 19 January 2017, 5:14:23.763655 pm
>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>> Ancestors: Multilingual-tfel.217
>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
>>> A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
>>> =============== Diff against Multilingual-tfel.217 ===============
>>> Item was changed:
>>> ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') -----
>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>> - ^ #('latin-1' 'latin1') copy.
>>> !
>>> Item was changed:
>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
>>> nextFromStream: aStream
>>>
>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>> char1 := aStream basicNext.
>>> char1 ifNil:[^ nil].
>>> + value1 := char1 asInteger.
>>> - value1 := char1 asciiValue.
>>> value1 <= 127 ifTrue: [
>>> "1-byte char"
>>> + ^ char1 asCharacter
>>> - ^ char1
>>> ].
>>>
>>> "at least 2-byte char"
>>> char2 := aStream basicNext.
>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
>>> + value2 := char2 asInteger.
>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>> - value2 := char2 asciiValue.
>>>
>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
>>> ].
>>>
>>> "at least 3-byte char"
>>> char3 := aStream basicNext.
>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
>>> + value3 := char3 asInteger.
>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
>>> - value3 := char3 asciiValue.
>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
>>> + (value3 bitAnd: 63).
>>> ].
>>>
>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>> "4-byte char"
>>> char4 := aStream basicNext.
>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>> + value4 := char4 asInteger.
>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>>> - value4 := char4 asciiValue.
>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>> ((value2 bitAnd: 63) bitShift: 12) +
>>> ((value3 bitAnd: 63) bitShift: 6) +
>>> (value4 bitAnd: 63).
>>> ].
>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>>> unicode > 16r10FFFD ifTrue: [
>>> + ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
>>> - ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
>>> ].
>>>
>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>> ^ Unicode value: unicode.
>>> !
>>>
>

Levente Uzonyi

Re: The Inbox: Multilingual-jr.218.mcz

On Fri, 20 Jan 2017, Tobias Pape wrote:

>
> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>
>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>
>>> Thanks Jacob.
>>>
>>> Any objections here I put this into trunk?
>>
>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>
> Didn't know that.
>
>> Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
>
> I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream.
>
> I was confused to see that asciiValue returns something >127 in the first place.

#asInteger does the same thing as #asciiValue. While #asciiValue doesn't
do what you would expect it to do, it has the advantage to clearly mark
the class of the receiver (in this case).

>
>>
>> If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
>
> Clearly.
> But isn't this a step in the right direction?

Yes and no. There are at least two ways to go:

1. Enhance the current stream library, even at the cost of breaking things.
A patch here and there won't work. There are fundamental changes required,
like stackable streams, to make it desirable to use it over other
libraries.

2. Integrate an existing stream library with better features (e.g. Xtreams)
If we were to do this, we could gradually migrate existing code to the new
library, and finally make the current stream library unloadable.

Levente

>
>>
>> Levente
>>
>>> Looks good from here.
>>>
>>> Best regards
>>> -Tobias
>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>
>>>> A new version of Multilingual was added to project The Inbox:
>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>> ==================== Summary ====================
>>>> Name: Multilingual-jr.218
>>>> Author: jr
>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>> Ancestors: Multilingual-tfel.217
>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
>>>> A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>> Item was changed:
>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') -----
>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>> - ^ #('latin-1' 'latin1') copy.
>>>> !
>>>> Item was changed:
>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
>>>> nextFromStream: aStream
>>>>
>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>> char1 := aStream basicNext.
>>>> char1 ifNil:[^ nil].
>>>> + value1 := char1 asInteger.
>>>> - value1 := char1 asciiValue.
>>>> value1 <= 127 ifTrue: [
>>>> "1-byte char"
>>>> + ^ char1 asCharacter
>>>> - ^ char1
>>>> ].
>>>>
>>>> "at least 2-byte char"
>>>> char2 := aStream basicNext.
>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
>>>> + value2 := char2 asInteger.
>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>> - value2 := char2 asciiValue.
>>>>
>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
>>>> ].
>>>>
>>>> "at least 3-byte char"
>>>> char3 := aStream basicNext.
>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
>>>> + value3 := char3 asInteger.
>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
>>>> - value3 := char3 asciiValue.
>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
>>>> + (value3 bitAnd: 63).
>>>> ].
>>>>
>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>> "4-byte char"
>>>> char4 := aStream basicNext.
>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>> + value4 := char4 asInteger.
>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>>>> - value4 := char4 asciiValue.
>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>> (value4 bitAnd: 63).
>>>> ].
>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>>>> unicode > 16r10FFFD ifTrue: [
>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
>>>> ].
>>>>
>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>> ^ Unicode value: unicode.
>>>> !
>>>>
>>

Tobias Pape

Re: The Inbox: Multilingual-jr.218.mcz

On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:

> On Fri, 20 Jan 2017, Tobias Pape wrote:
>
>>
>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>
>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>> Thanks Jacob.
>>>> Any objections here I put this into trunk?
>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>
>> Didn't know that.
>>
>>> Therefore #basicNext is expected to return a Character, provided the stream is not binary. This is why the #isBinary check is the first thing the method does.
>>
>> I see. however, using asInteger sounds more reasonable _even though_ it is a character. Said bluntly, the responsibility of the TextConverter is to make Characters from that bloody numbers in that stream.
>> I was confused to see that asciiValue returns something >127 in the first place.
>
> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't do what you would expect it to do, it has the advantage to clearly mark the class of the receiver (in this case).
>

Yes, and that's exactly why we should use #asInteger. To _not_ limit the receiver.
Because the receiver isn't actually a Character, but some number, encoded in a Character, whose meaning is to be determined by
this very method.

Also, how do we know that _basic_Next will always return a Character?
(Yes, I know there's a binary check, but doesn't that only say something about #next, not #basicNext?)

>>
>>> If there are plans to make TextConverters work with more general streams, then I persume these changes won't be enough.
>>
>> Clearly.
>> But isn't this a step in the right direction?
>
> Yes and no. There are at least two ways to go:
>
> 1. Enhance the current stream library, even at the cost of breaking things.
> A patch here and there won't work. There are fundamental changes required, like stackable streams, to make it desirable to use it over other libraries.
>
> 2. Integrate an existing stream library with better features (e.g. Xtreams)
> If we were to do this, we could gradually migrate existing code to the new library, and finally make the current stream library unloadable.

I like the idea of Xtreams, but I also like going baby steps.

The changes here help at least one person, won't hurt others and seem future proof.
So?

Best regards
-Tobias

> Levente
>
>>
>>> Levente
>>>> Looks good from here.
>>>> Best regards
>>>> -Tobias
>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>> A new version of Multilingual was added to project The Inbox:
>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>> ==================== Summary ====================
>>>>> Name: Multilingual-jr.218
>>>>> Author: jr
>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>> Ancestors: Multilingual-tfel.217
>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its input stream returns Characters from basicNext
>>>>> A stream implementation might always return bytes from basicNext and expect the conversion to Character to be done solely by the TextConverter, so use asInteger instead of asciiValue to support both cases. Convert back with asCharacter.
>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>> Item was changed:
>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category 'utilities') -----
>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>> !
>>>>> Item was changed:
>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category 'conversion') -----
>>>>> nextFromStream: aStream
>>>>>
>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>> char1 := aStream basicNext.
>>>>> char1 ifNil:[^ nil].
>>>>> + value1 := char1 asInteger.
>>>>> - value1 := char1 asciiValue.
>>>>> value1 <= 127 ifTrue: [
>>>>> "1-byte char"
>>>>> + ^ char1 asCharacter
>>>>> - ^ char1
>>>>> ].
>>>>>
>>>>> "at least 2-byte char"
>>>>> char2 := aStream basicNext.
>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter)].
>>>>> + value2 := char2 asInteger.
>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>> - value2 := char2 asciiValue.
>>>>>
>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd: 63).
>>>>> ].
>>>>>
>>>>> "at least 3-byte char"
>>>>> char3 := aStream basicNext.
>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter)].
>>>>> + value3 := char3 asInteger.
>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with: char2)].
>>>>> - value3 := char3 asciiValue.
>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63) bitShift: 6)
>>>>> + (value3 bitAnd: 63).
>>>>> ].
>>>>>
>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>> "4-byte char"
>>>>> char4 := aStream basicNext.
>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>> + value4 := char4 asInteger.
>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>>>>> - value4 := char4 asciiValue.
>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>> (value4 bitAnd: 63).
>>>>> ].
>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with: char2 with: char3)].
>>>>> unicode > 16r10FFFD ifTrue: [
>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with: char2 asCharacter with: char3 asCharacter).
>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with: char3).
>>>>> ].
>>>>>
>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>> ^ Unicode value: unicode.
>>>>> !

Hannes Hirzel

Re: The Inbox: Multilingual-jr.218.mcz

Below as a comparison the version in Pharo 5.0.

Noteworthy to say is that one can not speak about characters in an
UTF8 encoded stream which is read byte by byte until one has examined
the bytes.

So if I read the first thing it is actually a byte. Then I can examine
if it is a one-byte character and then return the character. Then I go
for the next byte. If it indicates that we have a two byte encoded
UTF8 character then I can return the character.

So I should have

byte1 := aStream basicNext.

... check if we have a one byte character, if yes return the character

byte2 := aStream basicNext.

... check if we have a two byte character, if yes return the character

byte3 := aStream basicNext.

... check if we have a three byte character, if yes return the character

byte4 := aStream basicNext.

... check if we have a four byte character, if yes return the character

nextFromStream: aStream
| character1 value1 character2 value2 unicode character3 value3
character4 value4 |
aStream isBinary
ifTrue: [ ^ aStream basicNext ].
character1 := aStream basicNext.
character1 isNil
ifTrue: [ ^ nil ].
value1 := character1 asciiValue.
value1 <= 127
ifTrue: [
"1-byte character"
^ character1 ]. "at least 2-byte character"
character2 := aStream basicNext.
character2 isNil
ifTrue: [ ^ self errorMalformedInput ].
value2 := character2 asciiValue.
(value1 bitAnd: 16rE0) = 192
ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
(value2 bitAnd: 63) ]. "at least 3-byte character"
character3 := aStream basicNext.
character3 isNil
ifTrue: [ ^ self errorMalformedInput ].
value3 := character3 asciiValue.
(value1 bitAnd: 16rF0) = 224
ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
(value1 bitAnd: 16rF8) = 240
ifTrue: [
"4-byte character"
character4 := aStream basicNext.
character4 isNil
ifTrue: [ ^ self errorMalformedInput ].
value4 := character4 asciiValue.
unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
63) bitShift: 12)
+ ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
unicode isNil
ifTrue: [ ^ self errorMalformedInput ].
unicode > 16r10FFFD
ifTrue: [ ^ self errorMalformedInput ].
unicode = 16rFEFF
ifTrue: [ ^ self nextFromStream: aStream ].
^ Unicode value: unicode

On 1/22/17, Tobias Pape <[hidden email]> wrote:

>
> On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:
>
>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>
>>>
>>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>>
>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>> Thanks Jacob.
>>>>> Any objections here I put this into trunk?
>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>
>>> Didn't know that.
>>>
>>>> Therefore #basicNext is expected to return a Character, provided the
>>>> stream is not binary. This is why the #isBinary check is the first thing
>>>> the method does.
>>>
>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>> is a character. Said bluntly, the responsibility of the TextConverter is
>>> to make Characters from that bloody numbers in that stream.
>>> I was confused to see that asciiValue returns something >127 in the first
>>> place.
>>
>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>> do what you would expect it to do, it has the advantage to clearly mark
>> the class of the receiver (in this case).
>>
>
> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
> receiver.
> Because the receiver isn't actually a Character, but some number, encoded in
> a Character, whose meaning is to be determined by
> this very method.
>
> Also, how do we know that _basic_Next will always return a Character?
> (Yes, I know there's a binary check, but doesn't that only say something
> about #next, not #basicNext?)
>
>
>>>
>>>> If there are plans to make TextConverters work with more general
>>>> streams, then I persume these changes won't be enough.
>>>
>>> Clearly.
>>> But isn't this a step in the right direction?
>>
>> Yes and no. There are at least two ways to go:
>>
>> 1. Enhance the current stream library, even at the cost of breaking
>> things.
>> A patch here and there won't work. There are fundamental changes required,
>> like stackable streams, to make it desirable to use it over other
>> libraries.
>>
>> 2. Integrate an existing stream library with better features (e.g.
>> Xtreams)
>> If we were to do this, we could gradually migrate existing code to the new
>> library, and finally make the current stream library unloadable.
>
> I like the idea of Xtreams, but I also like going baby steps.
>
> The changes here help at least one person, won't hurt others and seem future
> proof.
> So?
>
> Best regards
> -Tobias
>
>
>
>> Levente
>>
>>>
>>>> Levente
>>>>> Looks good from here.
>>>>> Best regards
>>>>> -Tobias
>>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>> ==================== Summary ====================
>>>>>> Name: Multilingual-jr.218
>>>>>> Author: jr
>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>> Ancestors: Multilingual-tfel.217
>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its
>>>>>> input stream returns Characters from basicNext
>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>> expect the conversion to Character to be done solely by the
>>>>>> TextConverter, so use asInteger instead of asciiValue to support both
>>>>>> cases. Convert back with asCharacter.
>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>> Item was changed:
>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>> 'utilities') -----
>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>> !
>>>>>> Item was changed:
>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>> 'conversion') -----
>>>>>> nextFromStream: aStream
>>>>>>
>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>> char1 := aStream basicNext.
>>>>>> char1 ifNil:[^ nil].
>>>>>> + value1 := char1 asInteger.
>>>>>> - value1 := char1 asciiValue.
>>>>>> value1 <= 127 ifTrue: [
>>>>>> "1-byte char"
>>>>>> + ^ char1 asCharacter
>>>>>> - ^ char1
>>>>>> ].
>>>>>>
>>>>>> "at least 2-byte char"
>>>>>> char2 := aStream basicNext.
>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>> asCharacter)].
>>>>>> + value2 := char2 asInteger.
>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>> - value2 := char2 asciiValue.
>>>>>>
>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd:
>>>>>> 63).
>>>>>> ].
>>>>>>
>>>>>> "at least 3-byte char"
>>>>>> char3 := aStream basicNext.
>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>> asCharacter with: char2 asCharacter)].
>>>>>> + value3 := char3 asInteger.
>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>> char2)].
>>>>>> - value3 := char3 asciiValue.
>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63)
>>>>>> bitShift: 6)
>>>>>> + (value3 bitAnd: 63).
>>>>>> ].
>>>>>>
>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>> "4-byte char"
>>>>>> char4 := aStream basicNext.
>>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>> + value4 := char4 asInteger.
>>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>> char2 with: char3)].
>>>>>> - value4 := char4 asciiValue.
>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>>> (value4 bitAnd: 63).
>>>>>> ].
>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>> char2 with: char3)].
>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>> char3).
>>>>>> ].
>>>>>>
>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>> ^ Unicode value: unicode.
>>>>>> !
>
>
>

Levente Uzonyi

Re: The Inbox: Multilingual-jr.218.mcz

The Pharo version seems to be the Squeak version optimized for
VisualWorks (ifNil: -> isNil ifTrue:).

Levente

On Mon, 23 Jan 2017, H. Hirzel wrote:

> Below as a comparison the version in Pharo 5.0.
>
> Noteworthy to say is that one can not speak about characters in an
> UTF8 encoded stream which is read byte by byte until one has examined
> the bytes.
>
> So if I read the first thing it is actually a byte. Then I can examine
> if it is a one-byte character and then return the character. Then I go
> for the next byte. If it indicates that we have a two byte encoded
> UTF8 character then I can return the character.
>
> So I should have
>
> byte1 := aStream basicNext.
>
> ... check if we have a one byte character, if yes return the character
>
> byte2 := aStream basicNext.
>
> ... check if we have a two byte character, if yes return the character
>
> byte3 := aStream basicNext.
>
> ... check if we have a three byte character, if yes return the character
>
>
> byte4 := aStream basicNext.
>
> ... check if we have a four byte character, if yes return the character
>
>
>
>
>
> nextFromStream: aStream
> | character1 value1 character2 value2 unicode character3 value3
> character4 value4 |
> aStream isBinary
> ifTrue: [ ^ aStream basicNext ].
> character1 := aStream basicNext.
> character1 isNil
> ifTrue: [ ^ nil ].
> value1 := character1 asciiValue.
> value1 <= 127
> ifTrue: [
> "1-byte character"
> ^ character1 ]. "at least 2-byte character"
> character2 := aStream basicNext.
> character2 isNil
> ifTrue: [ ^ self errorMalformedInput ].
> value2 := character2 asciiValue.
> (value1 bitAnd: 16rE0) = 192
> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
> (value2 bitAnd: 63) ]. "at least 3-byte character"
> character3 := aStream basicNext.
> character3 isNil
> ifTrue: [ ^ self errorMalformedInput ].
> value3 := character3 asciiValue.
> (value1 bitAnd: 16rF0) = 224
> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
> (value1 bitAnd: 16rF8) = 240
> ifTrue: [
> "4-byte character"
> character4 := aStream basicNext.
> character4 isNil
> ifTrue: [ ^ self errorMalformedInput ].
> value4 := character4 asciiValue.
> unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
> 63) bitShift: 12)
> + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
> unicode isNil
> ifTrue: [ ^ self errorMalformedInput ].
> unicode > 16r10FFFD
> ifTrue: [ ^ self errorMalformedInput ].
> unicode = 16rFEFF
> ifTrue: [ ^ self nextFromStream: aStream ].
> ^ Unicode value: unicode
>
>
>
>
>
> On 1/22/17, Tobias Pape <[hidden email]> wrote:
>>
>> On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:
>>
>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>
>>>>
>>>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>>>
>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>> Thanks Jacob.
>>>>>> Any objections here I put this into trunk?
>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>
>>>> Didn't know that.
>>>>
>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>> stream is not binary. This is why the #isBinary check is the first thing
>>>>> the method does.
>>>>
>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>> is a character. Said bluntly, the responsibility of the TextConverter is
>>>> to make Characters from that bloody numbers in that stream.
>>>> I was confused to see that asciiValue returns something >127 in the first
>>>> place.
>>>
>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>> do what you would expect it to do, it has the advantage to clearly mark
>>> the class of the receiver (in this case).
>>>
>>
>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>> receiver.
>> Because the receiver isn't actually a Character, but some number, encoded in
>> a Character, whose meaning is to be determined by
>> this very method.
>>
>> Also, how do we know that _basic_Next will always return a Character?
>> (Yes, I know there's a binary check, but doesn't that only say something
>> about #next, not #basicNext?)
>>
>>
>>>>
>>>>> If there are plans to make TextConverters work with more general
>>>>> streams, then I persume these changes won't be enough.
>>>>
>>>> Clearly.
>>>> But isn't this a step in the right direction?
>>>
>>> Yes and no. There are at least two ways to go:
>>>
>>> 1. Enhance the current stream library, even at the cost of breaking
>>> things.
>>> A patch here and there won't work. There are fundamental changes required,
>>> like stackable streams, to make it desirable to use it over other
>>> libraries.
>>>
>>> 2. Integrate an existing stream library with better features (e.g.
>>> Xtreams)
>>> If we were to do this, we could gradually migrate existing code to the new
>>> library, and finally make the current stream library unloadable.
>>
>> I like the idea of Xtreams, but I also like going baby steps.
>>
>> The changes here help at least one person, won't hurt others and seem future
>> proof.
>> So?
>>
>> Best regards
>> -Tobias
>>
>>
>>
>>> Levente
>>>
>>>>
>>>>> Levente
>>>>>> Looks good from here.
>>>>>> Best regards
>>>>>> -Tobias
>>>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>> ==================== Summary ====================
>>>>>>> Name: Multilingual-jr.218
>>>>>>> Author: jr
>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that its
>>>>>>> input stream returns Characters from basicNext
>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>> TextConverter, so use asInteger instead of asciiValue to support both
>>>>>>> cases. Convert back with asCharacter.
>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>> Item was changed:
>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>> 'utilities') -----
>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>> !
>>>>>>> Item was changed:
>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>> 'conversion') -----
>>>>>>> nextFromStream: aStream
>>>>>>>
>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>> char1 := aStream basicNext.
>>>>>>> char1 ifNil:[^ nil].
>>>>>>> + value1 := char1 asInteger.
>>>>>>> - value1 := char1 asciiValue.
>>>>>>> value1 <= 127 ifTrue: [
>>>>>>> "1-byte char"
>>>>>>> + ^ char1 asCharacter
>>>>>>> - ^ char1
>>>>>>> ].
>>>>>>>
>>>>>>> "at least 2-byte char"
>>>>>>> char2 := aStream basicNext.
>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter)].
>>>>>>> + value2 := char2 asInteger.
>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>> - value2 := char2 asciiValue.
>>>>>>>
>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2 bitAnd:
>>>>>>> 63).
>>>>>>> ].
>>>>>>>
>>>>>>> "at least 3-byte char"
>>>>>>> char3 := aStream basicNext.
>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>> + value3 := char3 asInteger.
>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2)].
>>>>>>> - value3 := char3 asciiValue.
>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd: 63)
>>>>>>> bitShift: 6)
>>>>>>> + (value3 bitAnd: 63).
>>>>>>> ].
>>>>>>>
>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>> "4-byte char"
>>>>>>> char4 := aStream basicNext.
>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>> + value4 := char4 asInteger.
>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2 with: char3)].
>>>>>>> - value4 := char4 asciiValue.
>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>>>> (value4 bitAnd: 63).
>>>>>>> ].
>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>> char2 with: char3)].
>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>> char3).
>>>>>>> ].
>>>>>>>
>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>> ^ Unicode value: unicode.
>>>>>>> !
>>
>>
>>

Hannes Hirzel

Re: The Inbox: Multilingual-jr.218.mcz

Interesting in this context the UTF8 decoding implementation of Pharo
5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)

ZnUTF8Encoder>>
nextFromStream: stream
| code byte next |
(byte := stream next) < 128
ifTrue: [ ^ Character codePoint: byte ].
(byte bitAnd: 2r11100000) == 2r11000000
ifTrue: [
code := byte bitAnd: 2r00011111.
((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
ifFalse: [ ^ self errorIllegalContinuationByte ].
code < 128 ifTrue: [ self errorOverlong ].
^ Character codePoint: code ].
(byte bitAnd: 2r11110000) == 2r11100000
ifTrue: [
code := byte bitAnd: 2r00001111.
2 timesRepeat: [
((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
ifFalse: [ ^ self errorIllegalContinuationByte ] ].
code < 2048 ifTrue: [ self errorOverlong ].
code = 65279 "Unicode Byte Order Mark" ifTrue: [
stream atEnd ifTrue: [ self errorIncomplete ].
^ self nextFromStream: stream ].
^ Character codePoint: code ].
(byte bitAnd: 2r11111000) == 2r11110000
ifTrue: [
code := byte bitAnd: 2r00000111.
3 timesRepeat: [
((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
2r11000000) == 2r10000000
ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
ifFalse: [ ^ self errorIllegalContinuationByte ] ].
code < 65535 ifTrue: [ self errorOverlong ].
^ Character codePoint: code ].
self errorIllegalLeadingByte

On 1/23/17, Levente Uzonyi <[hidden email]> wrote:

> The Pharo version seems to be the Squeak version optimized for
> VisualWorks (ifNil: -> isNil ifTrue:).
>
> Levente
>
> On Mon, 23 Jan 2017, H. Hirzel wrote:
>
>> Below as a comparison the version in Pharo 5.0.
>>
>> Noteworthy to say is that one can not speak about characters in an
>> UTF8 encoded stream which is read byte by byte until one has examined
>> the bytes.
>>
>> So if I read the first thing it is actually a byte. Then I can examine
>> if it is a one-byte character and then return the character. Then I go
>> for the next byte. If it indicates that we have a two byte encoded
>> UTF8 character then I can return the character.
>>
>> So I should have
>>
>> byte1 := aStream basicNext.
>>
>> ... check if we have a one byte character, if yes return the character
>>
>> byte2 := aStream basicNext.
>>
>> ... check if we have a two byte character, if yes return the character
>>
>> byte3 := aStream basicNext.
>>
>> ... check if we have a three byte character, if yes return the character
>>
>>
>> byte4 := aStream basicNext.
>>
>> ... check if we have a four byte character, if yes return the character
>>
>>
>>
>>
>>
>> nextFromStream: aStream
>> | character1 value1 character2 value2 unicode character3 value3
>> character4 value4 |
>> aStream isBinary
>> ifTrue: [ ^ aStream basicNext ].
>> character1 := aStream basicNext.
>> character1 isNil
>> ifTrue: [ ^ nil ].
>> value1 := character1 asciiValue.
>> value1 <= 127
>> ifTrue: [
>> "1-byte character"
>> ^ character1 ]. "at least 2-byte character"
>> character2 := aStream basicNext.
>> character2 isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> value2 := character2 asciiValue.
>> (value1 bitAnd: 16rE0) = 192
>> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
>> (value2 bitAnd: 63) ]. "at least 3-byte character"
>> character3 := aStream basicNext.
>> character3 isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> value3 := character3 asciiValue.
>> (value1 bitAnd: 16rF0) = 224
>> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
>> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
>> (value1 bitAnd: 16rF8) = 240
>> ifTrue: [
>> "4-byte character"
>> character4 := aStream basicNext.
>> character4 isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> value4 := character4 asciiValue.
>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
>> 63) bitShift: 12)
>> + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
>> unicode isNil
>> ifTrue: [ ^ self errorMalformedInput ].
>> unicode > 16r10FFFD
>> ifTrue: [ ^ self errorMalformedInput ].
>> unicode = 16rFEFF
>> ifTrue: [ ^ self nextFromStream: aStream ].
>> ^ Unicode value: unicode
>>
>>
>>
>>
>>
>> On 1/22/17, Tobias Pape <[hidden email]> wrote:
>>>
>>> On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:
>>>
>>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>>
>>>>>
>>>>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>>>>
>>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>>> Thanks Jacob.
>>>>>>> Any objections here I put this into trunk?
>>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>>
>>>>> Didn't know that.
>>>>>
>>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>>> stream is not binary. This is why the #isBinary check is the first
>>>>>> thing
>>>>>> the method does.
>>>>>
>>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>>> is a character. Said bluntly, the responsibility of the TextConverter
>>>>> is
>>>>> to make Characters from that bloody numbers in that stream.
>>>>> I was confused to see that asciiValue returns something >127 in the
>>>>> first
>>>>> place.
>>>>
>>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>>> do what you would expect it to do, it has the advantage to clearly mark
>>>> the class of the receiver (in this case).
>>>>
>>>
>>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>>> receiver.
>>> Because the receiver isn't actually a Character, but some number, encoded
>>> in
>>> a Character, whose meaning is to be determined by
>>> this very method.
>>>
>>> Also, how do we know that _basic_Next will always return a Character?
>>> (Yes, I know there's a binary check, but doesn't that only say something
>>> about #next, not #basicNext?)
>>>
>>>
>>>>>
>>>>>> If there are plans to make TextConverters work with more general
>>>>>> streams, then I persume these changes won't be enough.
>>>>>
>>>>> Clearly.
>>>>> But isn't this a step in the right direction?
>>>>
>>>> Yes and no. There are at least two ways to go:
>>>>
>>>> 1. Enhance the current stream library, even at the cost of breaking
>>>> things.
>>>> A patch here and there won't work. There are fundamental changes
>>>> required,
>>>> like stackable streams, to make it desirable to use it over other
>>>> libraries.
>>>>
>>>> 2. Integrate an existing stream library with better features (e.g.
>>>> Xtreams)
>>>> If we were to do this, we could gradually migrate existing code to the
>>>> new
>>>> library, and finally make the current stream library unloadable.
>>>
>>> I like the idea of Xtreams, but I also like going baby steps.
>>>
>>> The changes here help at least one person, won't hurt others and seem
>>> future
>>> proof.
>>> So?
>>>
>>> Best regards
>>> -Tobias
>>>
>>>
>>>
>>>> Levente
>>>>
>>>>>
>>>>>> Levente
>>>>>>> Looks good from here.
>>>>>>> Best regards
>>>>>>> -Tobias
>>>>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>>> ==================== Summary ====================
>>>>>>>> Name: Multilingual-jr.218
>>>>>>>> Author: jr
>>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that
>>>>>>>> its
>>>>>>>> input stream returns Characters from basicNext
>>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>>> TextConverter, so use asInteger instead of asciiValue to support
>>>>>>>> both
>>>>>>>> cases. Convert back with asCharacter.
>>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>>> Item was changed:
>>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>>> 'utilities') -----
>>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>>> !
>>>>>>>> Item was changed:
>>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>>> 'conversion') -----
>>>>>>>> nextFromStream: aStream
>>>>>>>>
>>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>>> char1 := aStream basicNext.
>>>>>>>> char1 ifNil:[^ nil].
>>>>>>>> + value1 := char1 asInteger.
>>>>>>>> - value1 := char1 asciiValue.
>>>>>>>> value1 <= 127 ifTrue: [
>>>>>>>> "1-byte char"
>>>>>>>> + ^ char1 asCharacter
>>>>>>>> - ^ char1
>>>>>>>> ].
>>>>>>>>
>>>>>>>> "at least 2-byte char"
>>>>>>>> char2 := aStream basicNext.
>>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter)].
>>>>>>>> + value2 := char2 asInteger.
>>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>>> - value2 := char2 asciiValue.
>>>>>>>>
>>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2
>>>>>>>> bitAnd:
>>>>>>>> 63).
>>>>>>>> ].
>>>>>>>>
>>>>>>>> "at least 3-byte char"
>>>>>>>> char3 := aStream basicNext.
>>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>>> + value3 := char3 asInteger.
>>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>>> char2)].
>>>>>>>> - value3 := char3 asciiValue.
>>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd:
>>>>>>>> 63)
>>>>>>>> bitShift: 6)
>>>>>>>> + (value3 bitAnd: 63).
>>>>>>>> ].
>>>>>>>>
>>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>>> "4-byte char"
>>>>>>>> char4 := aStream basicNext.
>>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>> + value4 := char4 asInteger.
>>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> with:
>>>>>>>> char2 with: char3)].
>>>>>>>> - value4 := char4 asciiValue.
>>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>>>>> (value4 bitAnd: 63).
>>>>>>>> ].
>>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>> with:
>>>>>>>> char2 with: char3)].
>>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>>> char3).
>>>>>>>> ].
>>>>>>>>
>>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>>> ^ Unicode value: unicode.
>>>>>>>> !
>>>
>>>
>>>
>
>

Jakob Reschke-2

Re: The Inbox: Multilingual-jr.218.mcz

In reply to this post by Levente Uzonyi

Well, the alternative to exploiting polymorphy like in the patch would
be to have a nearly identical bunch of "TextConverter" classes that
can read from binary streams and answer characters (like the Zinc UTF8
"Encoder" that also happens to have decoding methods). But most of the
code would be duplicated with the existing TextConverters, so I think
it is the worse solution.

IMHO, a fundamental thing such as a text encoding interpreter that is
shipped with a standard library should not be restricted by "it was
only designed to be used with consumer xyz". The TextConverter
hierarchy is neither marked as internal/private in Multilingual by a
category, nor by any comments, nor do the class names suggest coupling
to the MultiByte*Streams.

If I were to commit this thing from scratch, I would even drop the
support for returning the bytes as they are if the source stream
#isBinary, because it has nothing to do with text conversion. Instead,
do not use a TextConverter, if you do not want the bytes converted.
But, as Levente already wrote, we can not do this without a larger
refactoring and compatibility issues.

Best regards,
Jakob

2017-01-23 19:24 GMT+01:00 H. Hirzel <[hidden email]>:

> Interesting in this context the UTF8 decoding implementation of Pharo
> 5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)
>
> ZnUTF8Encoder>>
> nextFromStream: stream
> | code byte next |
> (byte := stream next) < 128
> ifTrue: [ ^ Character codePoint: byte ].
> (byte bitAnd: 2r11100000) == 2r11000000
> ifTrue: [
> code := byte bitAnd: 2r00011111.
> ((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
> 2r11000000) == 2r10000000
> ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
> ifFalse: [ ^ self errorIllegalContinuationByte ].
> code < 128 ifTrue: [ self errorOverlong ].
> ^ Character codePoint: code ].
> (byte bitAnd: 2r11110000) == 2r11100000
> ifTrue: [
> code := byte bitAnd: 2r00001111.
> 2 timesRepeat: [
> ((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
> 2r11000000) == 2r10000000
> ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
> ifFalse: [ ^ self errorIllegalContinuationByte ] ].
> code < 2048 ifTrue: [ self errorOverlong ].
> code = 65279 "Unicode Byte Order Mark" ifTrue: [
> stream atEnd ifTrue: [ self errorIncomplete ].
> ^ self nextFromStream: stream ].
> ^ Character codePoint: code ].
> (byte bitAnd: 2r11111000) == 2r11110000
> ifTrue: [
> code := byte bitAnd: 2r00000111.
> 3 timesRepeat: [
> ((next := stream next ifNil: [ self errorIncomplete ]) bitAnd:
> 2r11000000) == 2r10000000
> ifTrue: [ code := (code bitShift: 6) + (next bitAnd: 2r00111111) ]
> ifFalse: [ ^ self errorIllegalContinuationByte ] ].
> code < 65535 ifTrue: [ self errorOverlong ].
> ^ Character codePoint: code ].
> self errorIllegalLeadingByte
>
> On 1/23/17, Levente Uzonyi <[hidden email]> wrote:
>> The Pharo version seems to be the Squeak version optimized for
>> VisualWorks (ifNil: -> isNil ifTrue:).
>>
>> Levente
>>
>> On Mon, 23 Jan 2017, H. Hirzel wrote:
>>
>>> Below as a comparison the version in Pharo 5.0.
>>>
>>> Noteworthy to say is that one can not speak about characters in an
>>> UTF8 encoded stream which is read byte by byte until one has examined
>>> the bytes.
>>>
>>> So if I read the first thing it is actually a byte. Then I can examine
>>> if it is a one-byte character and then return the character. Then I go
>>> for the next byte. If it indicates that we have a two byte encoded
>>> UTF8 character then I can return the character.
>>>
>>> So I should have
>>>
>>> byte1 := aStream basicNext.
>>>
>>> ... check if we have a one byte character, if yes return the character
>>>
>>> byte2 := aStream basicNext.
>>>
>>> ... check if we have a two byte character, if yes return the character
>>>
>>> byte3 := aStream basicNext.
>>>
>>> ... check if we have a three byte character, if yes return the character
>>>
>>>
>>> byte4 := aStream basicNext.
>>>
>>> ... check if we have a four byte character, if yes return the character
>>>
>>>
>>>
>>>
>>>
>>> nextFromStream: aStream
>>> | character1 value1 character2 value2 unicode character3 value3
>>> character4 value4 |
>>> aStream isBinary
>>> ifTrue: [ ^ aStream basicNext ].
>>> character1 := aStream basicNext.
>>> character1 isNil
>>> ifTrue: [ ^ nil ].
>>> value1 := character1 asciiValue.
>>> value1 <= 127
>>> ifTrue: [
>>> "1-byte character"
>>> ^ character1 ]. "at least 2-byte character"
>>> character2 := aStream basicNext.
>>> character2 isNil
>>> ifTrue: [ ^ self errorMalformedInput ].
>>> value2 := character2 asciiValue.
>>> (value1 bitAnd: 16rE0) = 192
>>> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) +
>>> (value2 bitAnd: 63) ]. "at least 3-byte character"
>>> character3 := aStream basicNext.
>>> character3 isNil
>>> ifTrue: [ ^ self errorMalformedInput ].
>>> value3 := character3 asciiValue.
>>> (value1 bitAnd: 16rF0) = 224
>>> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2
>>> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
>>> (value1 bitAnd: 16rF8) = 240
>>> ifTrue: [
>>> "4-byte character"
>>> character4 := aStream basicNext.
>>> character4 isNil
>>> ifTrue: [ ^ self errorMalformedInput ].
>>> value4 := character4 asciiValue.
>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) + ((value2 bitAnd:
>>> 63) bitShift: 12)
>>> + ((value3 bitAnd: 63) bitShift: 6) + (value4 bitAnd: 63) ].
>>> unicode isNil
>>> ifTrue: [ ^ self errorMalformedInput ].
>>> unicode > 16r10FFFD
>>> ifTrue: [ ^ self errorMalformedInput ].
>>> unicode = 16rFEFF
>>> ifTrue: [ ^ self nextFromStream: aStream ].
>>> ^ Unicode value: unicode
>>>
>>>
>>>
>>>
>>>
>>> On 1/22/17, Tobias Pape <[hidden email]> wrote:
>>>>
>>>> On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:
>>>>
>>>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>>>
>>>>>>
>>>>>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>>>>>
>>>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>>>> Thanks Jacob.
>>>>>>>> Any objections here I put this into trunk?
>>>>>>> Yep. TextConverters are intended to work with MultiByte*Streams only.
>>>>>>
>>>>>> Didn't know that.
>>>>>>
>>>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>>>> stream is not binary. This is why the #isBinary check is the first
>>>>>>> thing
>>>>>>> the method does.
>>>>>>
>>>>>> I see. however, using asInteger sounds more reasonable _even though_ it
>>>>>> is a character. Said bluntly, the responsibility of the TextConverter
>>>>>> is
>>>>>> to make Characters from that bloody numbers in that stream.
>>>>>> I was confused to see that asciiValue returns something >127 in the
>>>>>> first
>>>>>> place.
>>>>>
>>>>> #asInteger does the same thing as #asciiValue. While #asciiValue doesn't
>>>>> do what you would expect it to do, it has the advantage to clearly mark
>>>>> the class of the receiver (in this case).
>>>>>
>>>>
>>>> Yes, and that's exactly why we should use #asInteger. To _not_ limit the
>>>> receiver.
>>>> Because the receiver isn't actually a Character, but some number, encoded
>>>> in
>>>> a Character, whose meaning is to be determined by
>>>> this very method.
>>>>
>>>> Also, how do we know that _basic_Next will always return a Character?
>>>> (Yes, I know there's a binary check, but doesn't that only say something
>>>> about #next, not #basicNext?)
>>>>
>>>>
>>>>>>
>>>>>>> If there are plans to make TextConverters work with more general
>>>>>>> streams, then I persume these changes won't be enough.
>>>>>>
>>>>>> Clearly.
>>>>>> But isn't this a step in the right direction?
>>>>>
>>>>> Yes and no. There are at least two ways to go:
>>>>>
>>>>> 1. Enhance the current stream library, even at the cost of breaking
>>>>> things.
>>>>> A patch here and there won't work. There are fundamental changes
>>>>> required,
>>>>> like stackable streams, to make it desirable to use it over other
>>>>> libraries.
>>>>>
>>>>> 2. Integrate an existing stream library with better features (e.g.
>>>>> Xtreams)
>>>>> If we were to do this, we could gradually migrate existing code to the
>>>>> new
>>>>> library, and finally make the current stream library unloadable.
>>>>
>>>> I like the idea of Xtreams, but I also like going baby steps.
>>>>
>>>> The changes here help at least one person, won't hurt others and seem
>>>> future
>>>> proof.
>>>> So?
>>>>
>>>> Best regards
>>>> -Tobias
>>>>
>>>>
>>>>
>>>>> Levente
>>>>>
>>>>>>
>>>>>>> Levente
>>>>>>>> Looks good from here.
>>>>>>>> Best regards
>>>>>>>> -Tobias
>>>>>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>>>> ==================== Summary ====================
>>>>>>>>> Name: Multilingual-jr.218
>>>>>>>>> Author: jr
>>>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that
>>>>>>>>> its
>>>>>>>>> input stream returns Characters from basicNext
>>>>>>>>> A stream implementation might always return bytes from basicNext and
>>>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>>>> TextConverter, so use asInteger instead of asciiValue to support
>>>>>>>>> both
>>>>>>>>> cases. Convert back with asCharacter.
>>>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>>>> Item was changed:
>>>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in category
>>>>>>>>> 'utilities') -----
>>>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>>>> !
>>>>>>>>> Item was changed:
>>>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>>>> 'conversion') -----
>>>>>>>>> nextFromStream: aStream
>>>>>>>>>
>>>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4 value4 |
>>>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>>>> char1 := aStream basicNext.
>>>>>>>>> char1 ifNil:[^ nil].
>>>>>>>>> + value1 := char1 asInteger.
>>>>>>>>> - value1 := char1 asciiValue.
>>>>>>>>> value1 <= 127 ifTrue: [
>>>>>>>>> "1-byte char"
>>>>>>>>> + ^ char1 asCharacter
>>>>>>>>> - ^ char1
>>>>>>>>> ].
>>>>>>>>>
>>>>>>>>> "at least 2-byte char"
>>>>>>>>> char2 := aStream basicNext.
>>>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>> asCharacter)].
>>>>>>>>> + value2 := char2 asInteger.
>>>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with: char1)].
>>>>>>>>> - value2 := char2 asciiValue.
>>>>>>>>>
>>>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6) + (value2
>>>>>>>>> bitAnd:
>>>>>>>>> 63).
>>>>>>>>> ].
>>>>>>>>>
>>>>>>>>> "at least 3-byte char"
>>>>>>>>> char3 := aStream basicNext.
>>>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>>>> + value3 := char3 asInteger.
>>>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1 with:
>>>>>>>>> char2)].
>>>>>>>>> - value3 := char3 asciiValue.
>>>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) + ((value2 bitAnd:
>>>>>>>>> 63)
>>>>>>>>> bitShift: 6)
>>>>>>>>> + (value3 bitAnd: 63).
>>>>>>>>> ].
>>>>>>>>>
>>>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>>>> "4-byte char"
>>>>>>>>> char4 := aStream basicNext.
>>>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>>> + value4 := char4 asInteger.
>>>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>> with:
>>>>>>>>> char2 with: char3)].
>>>>>>>>> - value4 := char4 asciiValue.
>>>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>>>> ((value2 bitAnd: 63) bitShift: 12) +
>>>>>>>>> ((value3 bitAnd: 63) bitShift: 6) +
>>>>>>>>> (value4 bitAnd: 63).
>>>>>>>>> ].
>>>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>> with:
>>>>>>>>> char2 with: char3)].
>>>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>>>> + ^self errorMalformedInput: (String with: char1 asCharacter with:
>>>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>>>> - ^self errorMalformedInput: (String with: char1 with: char2 with:
>>>>>>>>> char3).
>>>>>>>>> ].
>>>>>>>>>
>>>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>>>> ^ Unicode value: unicode.
>>>>>>>>> !
>>>>
>>>>
>>>>
>>
>>
>

Hannes Hirzel

Re: The Inbox: Multilingual-jr.218.mcz

On 1/27/17, Jakob Reschke <[hidden email]> wrote:
> Well, the alternative to exploiting polymorphy like in the patch would
> be to have a nearly identical bunch of "TextConverter" classes that
> can read from binary streams and answer characters (like the Zinc UTF8
> "Encoder" that also happens to have decoding methods). But most of the
> code would be duplicated with the existing TextConverters, so I think
> it is the worse solution.

To have duplicate classes or even hierarchies for some time is
necessary if you want to do changes which cannot be done
incrementally.

See for example the class CrLfFileStream and the subclass
HtmlFileStream which are both in the package
'51-Deprecated-Files-Kernel'.

Class comment of CrLfFileStream states
'This class is now obsolete, use MultiByteFileStream instead.'

This can be carried further ...

An option is as well to consider porting the Zinc UTF8 class(es).

>
> IMHO, a fundamental thing such as a text encoding interpreter that is
> shipped with a standard library should not be restricted by "it was
> only designed to be used with consumer xyz".
Yes.

> The TextConverter
> hierarchy is neither marked as internal/private in Multilingual by a
> category, nor by any comments, nor do the class names suggest coupling
> to the MultiByte*Streams.

The TextConverter hierarchy is in the 'Multilingual' category.

> If I were to commit this thing from scratch, I would even drop the
> support for returning the bytes as they are if the source stream
> #isBinary, because it has nothing to do with text conversion.

> Instead,
> do not use a TextConverter, if you do not want the bytes converted.

I do not understand: why would I want to use a TextConverter if I am
reading a binary stream.

> But, as Levente already wrote, we can not do this without a larger
> refactoring and compatibility issues.

--Hannes

> Best regards,
> Jakob
>
> 2017-01-23 19:24 GMT+01:00 H. Hirzel <[hidden email]>:
>> Interesting in this context the UTF8 decoding implementation of Pharo
>> 5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)
>>
>> ZnUTF8Encoder>>
>> nextFromStream: stream
>> | code byte next |
>> (byte := stream next) < 128
>> ifTrue: [ ^ Character codePoint: byte ].
>> (byte bitAnd: 2r11100000) == 2r11000000
>> ifTrue: [
>> code := byte bitAnd: 2r00011111.
>> ((next := stream next ifNil: [ self
>> errorIncomplete ]) bitAnd:
>> 2r11000000) == 2r10000000
>> ifTrue: [ code := (code bitShift: 6) +
>> (next bitAnd: 2r00111111) ]
>> ifFalse: [ ^ self
>> errorIllegalContinuationByte ].
>> code < 128 ifTrue: [ self errorOverlong ].
>> ^ Character codePoint: code ].
>> (byte bitAnd: 2r11110000) == 2r11100000
>> ifTrue: [
>> code := byte bitAnd: 2r00001111.
>> 2 timesRepeat: [
>> ((next := stream next ifNil: [ self
>> errorIncomplete ]) bitAnd:
>> 2r11000000) == 2r10000000
>> ifTrue: [ code := (code bitShift:
>> 6) + (next bitAnd: 2r00111111) ]
>> ifFalse: [ ^ self
>> errorIllegalContinuationByte ] ].
>> code < 2048 ifTrue: [ self errorOverlong ].
>> code = 65279 "Unicode Byte Order Mark" ifTrue: [
>> stream atEnd ifTrue: [ self
>> errorIncomplete ].
>> ^ self nextFromStream: stream ].
>> ^ Character codePoint: code ].
>> (byte bitAnd: 2r11111000) == 2r11110000
>> ifTrue: [
>> code := byte bitAnd: 2r00000111.
>> 3 timesRepeat: [
>> ((next := stream next ifNil: [ self
>> errorIncomplete ]) bitAnd:
>> 2r11000000) == 2r10000000
>> ifTrue: [ code := (code bitShift:
>> 6) + (next bitAnd: 2r00111111) ]
>> ifFalse: [ ^ self
>> errorIllegalContinuationByte ] ].
>> code < 65535 ifTrue: [ self errorOverlong ].
>> ^ Character codePoint: code ].
>> self errorIllegalLeadingByte
>>
>> On 1/23/17, Levente Uzonyi <[hidden email]> wrote:
>>> The Pharo version seems to be the Squeak version optimized for
>>> VisualWorks (ifNil: -> isNil ifTrue:).
>>>
>>> Levente
>>>
>>> On Mon, 23 Jan 2017, H. Hirzel wrote:
>>>
>>>> Below as a comparison the version in Pharo 5.0.
>>>>
>>>> Noteworthy to say is that one can not speak about characters in an
>>>> UTF8 encoded stream which is read byte by byte until one has examined
>>>> the bytes.
>>>>
>>>> So if I read the first thing it is actually a byte. Then I can examine
>>>> if it is a one-byte character and then return the character. Then I go
>>>> for the next byte. If it indicates that we have a two byte encoded
>>>> UTF8 character then I can return the character.
>>>>
>>>> So I should have
>>>>
>>>> byte1 := aStream basicNext.
>>>>
>>>> ... check if we have a one byte character, if yes return the character
>>>>
>>>> byte2 := aStream basicNext.
>>>>
>>>> ... check if we have a two byte character, if yes return the character
>>>>
>>>> byte3 := aStream basicNext.
>>>>
>>>> ... check if we have a three byte character, if yes return the character
>>>>
>>>>
>>>> byte4 := aStream basicNext.
>>>>
>>>> ... check if we have a four byte character, if yes return the character
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> nextFromStream: aStream
>>>> | character1 value1 character2 value2 unicode character3 value3
>>>> character4 value4 |
>>>> aStream isBinary
>>>> ifTrue: [ ^ aStream basicNext ].
>>>> character1 := aStream basicNext.
>>>> character1 isNil
>>>> ifTrue: [ ^ nil ].
>>>> value1 := character1 asciiValue.
>>>> value1 <= 127
>>>> ifTrue: [
>>>> "1-byte character"
>>>> ^ character1 ]. "at least 2-byte character"
>>>> character2 := aStream basicNext.
>>>> character2 isNil
>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>> value2 := character2 asciiValue.
>>>> (value1 bitAnd: 16rE0) = 192
>>>> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift:
>>>> 6) +
>>>> (value2 bitAnd: 63) ]. "at least 3-byte character"
>>>> character3 := aStream basicNext.
>>>> character3 isNil
>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>> value3 := character3 asciiValue.
>>>> (value1 bitAnd: 16rF0) = 224
>>>> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) +
>>>> ((value2
>>>> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
>>>> (value1 bitAnd: 16rF8) = 240
>>>> ifTrue: [
>>>> "4-byte character"
>>>> character4 := aStream basicNext.
>>>> character4 isNil
>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>> value4 := character4 asciiValue.
>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>> ((value2 bitAnd:
>>>> 63) bitShift: 12)
>>>> + ((value3 bitAnd: 63) bitShift: 6) +
>>>> (value4 bitAnd: 63) ].
>>>> unicode isNil
>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>> unicode > 16r10FFFD
>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>> unicode = 16rFEFF
>>>> ifTrue: [ ^ self nextFromStream: aStream ].
>>>> ^ Unicode value: unicode
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/22/17, Tobias Pape <[hidden email]> wrote:
>>>>>
>>>>> On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:
>>>>>
>>>>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>>>>
>>>>>>>
>>>>>>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>>>>>>
>>>>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>>>>> Thanks Jacob.
>>>>>>>>> Any objections here I put this into trunk?
>>>>>>>> Yep. TextConverters are intended to work with MultiByte*Streams
>>>>>>>> only.
>>>>>>>
>>>>>>> Didn't know that.
>>>>>>>
>>>>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>>>>> stream is not binary. This is why the #isBinary check is the first
>>>>>>>> thing
>>>>>>>> the method does.
>>>>>>>
>>>>>>> I see. however, using asInteger sounds more reasonable _even though_
>>>>>>> it
>>>>>>> is a character. Said bluntly, the responsibility of the TextConverter
>>>>>>> is
>>>>>>> to make Characters from that bloody numbers in that stream.
>>>>>>> I was confused to see that asciiValue returns something >127 in the
>>>>>>> first
>>>>>>> place.
>>>>>>
>>>>>> #asInteger does the same thing as #asciiValue. While #asciiValue
>>>>>> doesn't
>>>>>> do what you would expect it to do, it has the advantage to clearly
>>>>>> mark
>>>>>> the class of the receiver (in this case).
>>>>>>
>>>>>
>>>>> Yes, and that's exactly why we should use #asInteger. To _not_ limit
>>>>> the
>>>>> receiver.
>>>>> Because the receiver isn't actually a Character, but some number,
>>>>> encoded
>>>>> in
>>>>> a Character, whose meaning is to be determined by
>>>>> this very method.
>>>>>
>>>>> Also, how do we know that _basic_Next will always return a Character?
>>>>> (Yes, I know there's a binary check, but doesn't that only say
>>>>> something
>>>>> about #next, not #basicNext?)
>>>>>
>>>>>
>>>>>>>
>>>>>>>> If there are plans to make TextConverters work with more general
>>>>>>>> streams, then I persume these changes won't be enough.
>>>>>>>
>>>>>>> Clearly.
>>>>>>> But isn't this a step in the right direction?
>>>>>>
>>>>>> Yes and no. There are at least two ways to go:
>>>>>>
>>>>>> 1. Enhance the current stream library, even at the cost of breaking
>>>>>> things.
>>>>>> A patch here and there won't work. There are fundamental changes
>>>>>> required,
>>>>>> like stackable streams, to make it desirable to use it over other
>>>>>> libraries.
>>>>>>
>>>>>> 2. Integrate an existing stream library with better features (e.g.
>>>>>> Xtreams)
>>>>>> If we were to do this, we could gradually migrate existing code to the
>>>>>> new
>>>>>> library, and finally make the current stream library unloadable.
>>>>>
>>>>> I like the idea of Xtreams, but I also like going baby steps.
>>>>>
>>>>> The changes here help at least one person, won't hurt others and seem
>>>>> future
>>>>> proof.
>>>>> So?
>>>>>
>>>>> Best regards
>>>>> -Tobias
>>>>>
>>>>>
>>>>>
>>>>>> Levente
>>>>>>
>>>>>>>
>>>>>>>> Levente
>>>>>>>>> Looks good from here.
>>>>>>>>> Best regards
>>>>>>>>> -Tobias
>>>>>>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>>>>> ==================== Summary ====================
>>>>>>>>>> Name: Multilingual-jr.218
>>>>>>>>>> Author: jr
>>>>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that
>>>>>>>>>> its
>>>>>>>>>> input stream returns Characters from basicNext
>>>>>>>>>> A stream implementation might always return bytes from basicNext
>>>>>>>>>> and
>>>>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>>>>> TextConverter, so use asInteger instead of asciiValue to support
>>>>>>>>>> both
>>>>>>>>>> cases. Convert back with asCharacter.
>>>>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>>>>> Item was changed:
>>>>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in
>>>>>>>>>> category
>>>>>>>>>> 'utilities') -----
>>>>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>>>>> !
>>>>>>>>>> Item was changed:
>>>>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>>>>> 'conversion') -----
>>>>>>>>>> nextFromStream: aStream
>>>>>>>>>>
>>>>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4
>>>>>>>>>> value4 |
>>>>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>>>>> char1 := aStream basicNext.
>>>>>>>>>> char1 ifNil:[^ nil].
>>>>>>>>>> + value1 := char1 asInteger.
>>>>>>>>>> - value1 := char1 asciiValue.
>>>>>>>>>> value1 <= 127 ifTrue: [
>>>>>>>>>> "1-byte char"
>>>>>>>>>> + ^ char1 asCharacter
>>>>>>>>>> - ^ char1
>>>>>>>>>> ].
>>>>>>>>>>
>>>>>>>>>> "at least 2-byte char"
>>>>>>>>>> char2 := aStream basicNext.
>>>>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>>> asCharacter)].
>>>>>>>>>> + value2 := char2 asInteger.
>>>>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with:
>>>>>>>>>> char1)].
>>>>>>>>>> - value2 := char2 asciiValue.
>>>>>>>>>>
>>>>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6)
>>>>>>>>>> + (value2
>>>>>>>>>> bitAnd:
>>>>>>>>>> 63).
>>>>>>>>>> ].
>>>>>>>>>>
>>>>>>>>>> "at least 3-byte char"
>>>>>>>>>> char3 := aStream basicNext.
>>>>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>>>>> + value3 := char3 asInteger.
>>>>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>>> with:
>>>>>>>>>> char2)].
>>>>>>>>>> - value3 := char3 asciiValue.
>>>>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) +
>>>>>>>>>> ((value2 bitAnd:
>>>>>>>>>> 63)
>>>>>>>>>> bitShift: 6)
>>>>>>>>>> + (value3 bitAnd: 63).
>>>>>>>>>> ].
>>>>>>>>>>
>>>>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>>>>> "4-byte char"
>>>>>>>>>> char4 := aStream basicNext.
>>>>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String
>>>>>>>>>> with: char1
>>>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>>>> + value4 := char4 asInteger.
>>>>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String
>>>>>>>>>> with: char1
>>>>>>>>>> with:
>>>>>>>>>> char2 with: char3)].
>>>>>>>>>> - value4 := char4 asciiValue.
>>>>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>>>>> ((value2 bitAnd: 63)
>>>>>>>>>> bitShift: 12) +
>>>>>>>>>> ((value3 bitAnd: 63)
>>>>>>>>>> bitShift: 6) +
>>>>>>>>>> (value4 bitAnd: 63).
>>>>>>>>>> ].
>>>>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with:
>>>>>>>>>> char1
>>>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with:
>>>>>>>>>> char1
>>>>>>>>>> with:
>>>>>>>>>> char2 with: char3)].
>>>>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>>>>> + ^self errorMalformedInput: (String with: char1
>>>>>>>>>> asCharacter with:
>>>>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>>>>> - ^self errorMalformedInput: (String with: char1
>>>>>>>>>> with: char2 with:
>>>>>>>>>> char3).
>>>>>>>>>> ].
>>>>>>>>>>
>>>>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>>>>> ^ Unicode value: unicode.
>>>>>>>>>> !
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>
>

Jakob Reschke-2

Re: The Inbox: Multilingual-jr.218.mcz

In reply to this post by Jakob Reschke-2

2017-01-31 17:40 GMT+01:00 H. Hirzel <[hidden email]>:

> On 1/27/17, Jakob Reschke <[hidden email]> wrote:
>> Well, the alternative to exploiting polymorphy like in the patch would
>> be to have a nearly identical bunch of "TextConverter" classes that
>> can read from binary streams and answer characters (like the Zinc UTF8
>> "Encoder" that also happens to have decoding methods). But most of the
>> code would be duplicated with the existing TextConverters, so I think
>> it is the worse solution.
>
> To have duplicate classes or even hierarchies for some time is
> necessary if you want to do changes which cannot be done
> incrementally.

Sure, but in the restricted case of this patch, I think it is not
necessary because the behavior should not change in the traditional
case where basicNext does indeed return characters if the stream is in
text mode. Nothing should be broken.

I acknowledge that this patch alone is not all that useful for the
Multilingual package itself, but it is useful for reusing the
UTF8TextConverter to read from other binary streams (especially if
such a stream cannot be switched to a text mode at all).

>
> See for example the class CrLfFileStream and the subclass
> HtmlFileStream which are both in the package
> '51-Deprecated-Files-Kernel'.
>
> Class comment of CrLfFileStream states
> 'This class is now obsolete, use MultiByteFileStream instead.'
>
> This can be carried further ...
>
> An option is as well to consider porting the Zinc UTF8 class(es).

Have not tried it, but probably one can just load and start to use it
right away if it fits. The actual UTF8 decoding and encoding
implementation is probably very similar, the important difference is
only in the interface (including expected types). But I do not want to
have to copy a class from Pharo to get text conversion from a
binary-only stream going in Squeak... For a library, the additional
dependency should be avoided as well.

>
>
>>
>> IMHO, a fundamental thing such as a text encoding interpreter that is
>> shipped with a standard library should not be restricted by "it was
>> only designed to be used with consumer xyz".
> Yes.
>
>> The TextConverter
>> hierarchy is neither marked as internal/private in Multilingual by a
>> category, nor by any comments, nor do the class names suggest coupling
>> to the MultiByte*Streams.
>
> The TextConverter hierarchy is in the 'Multilingual' category.

Yes, it is, but from a documentation point of view, the TextConverters
are no more private to the Multilingual package than the
MultiByte*Streams are. Of course, TextConverters and MultiByte*Streams
must not cease to be compatible with one another.

>
>
>> If I were to commit this thing from scratch, I would even drop the
>> support for returning the bytes as they are if the source stream
>> #isBinary, because it has nothing to do with text conversion.
>
>> Instead,
>> do not use a TextConverter, if you do not want the bytes converted.
>
> I do not understand: why would I want to use a TextConverter if I am
> reading a binary stream.

Exactly my point! :-) Currently, both MultiByteStream implementations always do

self converter nextFromStream: self

in #next, even if the stream is in binary mode. That's why the text
converter currently has to check if the stream is in binary mode, and
if it is, skip the text conversion and pass through the bytes
uninterpreted. I would expect that the text converter were not used at
all when the stream is binary.

For a different answer to the question "Why would I want to use a
TextConverter if I am reading a binary stream", go back to my previous
paragraphs: You need one if the data in the stream is actually text
and you know the encoding format, but the stream can only return
bytes.

Best regards,
Jakob

>
>> But, as Levente already wrote, we can not do this without a larger
>> refactoring and compatibility issues.
>
>
> --Hannes
>
>> Best regards,
>> Jakob
>>
>> 2017-01-23 19:24 GMT+01:00 H. Hirzel <[hidden email]>:
>>> Interesting in this context the UTF8 decoding implementation of Pharo
>>> 5 ZnUTF8Encoder (an alternative to UTF8TextConverter it seems)
>>>
>>> ZnUTF8Encoder>>
>>> nextFromStream: stream
>>> | code byte next |
>>> (byte := stream next) < 128
>>> ifTrue: [ ^ Character codePoint: byte ].
>>> (byte bitAnd: 2r11100000) == 2r11000000
>>> ifTrue: [
>>> code := byte bitAnd: 2r00011111.
>>> ((next := stream next ifNil: [ self
>>> errorIncomplete ]) bitAnd:
>>> 2r11000000) == 2r10000000
>>> ifTrue: [ code := (code bitShift: 6) +
>>> (next bitAnd: 2r00111111) ]
>>> ifFalse: [ ^ self
>>> errorIllegalContinuationByte ].
>>> code < 128 ifTrue: [ self errorOverlong ].
>>> ^ Character codePoint: code ].
>>> (byte bitAnd: 2r11110000) == 2r11100000
>>> ifTrue: [
>>> code := byte bitAnd: 2r00001111.
>>> 2 timesRepeat: [
>>> ((next := stream next ifNil: [ self
>>> errorIncomplete ]) bitAnd:
>>> 2r11000000) == 2r10000000
>>> ifTrue: [ code := (code bitShift:
>>> 6) + (next bitAnd: 2r00111111) ]
>>> ifFalse: [ ^ self
>>> errorIllegalContinuationByte ] ].
>>> code < 2048 ifTrue: [ self errorOverlong ].
>>> code = 65279 "Unicode Byte Order Mark" ifTrue: [
>>> stream atEnd ifTrue: [ self
>>> errorIncomplete ].
>>> ^ self nextFromStream: stream ].
>>> ^ Character codePoint: code ].
>>> (byte bitAnd: 2r11111000) == 2r11110000
>>> ifTrue: [
>>> code := byte bitAnd: 2r00000111.
>>> 3 timesRepeat: [
>>> ((next := stream next ifNil: [ self
>>> errorIncomplete ]) bitAnd:
>>> 2r11000000) == 2r10000000
>>> ifTrue: [ code := (code bitShift:
>>> 6) + (next bitAnd: 2r00111111) ]
>>> ifFalse: [ ^ self
>>> errorIllegalContinuationByte ] ].
>>> code < 65535 ifTrue: [ self errorOverlong ].
>>> ^ Character codePoint: code ].
>>> self errorIllegalLeadingByte
>>>
>>> On 1/23/17, Levente Uzonyi <[hidden email]> wrote:
>>>> The Pharo version seems to be the Squeak version optimized for
>>>> VisualWorks (ifNil: -> isNil ifTrue:).
>>>>
>>>> Levente
>>>>
>>>> On Mon, 23 Jan 2017, H. Hirzel wrote:
>>>>
>>>>> Below as a comparison the version in Pharo 5.0.
>>>>>
>>>>> Noteworthy to say is that one can not speak about characters in an
>>>>> UTF8 encoded stream which is read byte by byte until one has examined
>>>>> the bytes.
>>>>>
>>>>> So if I read the first thing it is actually a byte. Then I can examine
>>>>> if it is a one-byte character and then return the character. Then I go
>>>>> for the next byte. If it indicates that we have a two byte encoded
>>>>> UTF8 character then I can return the character.
>>>>>
>>>>> So I should have
>>>>>
>>>>> byte1 := aStream basicNext.
>>>>>
>>>>> ... check if we have a one byte character, if yes return the character
>>>>>
>>>>> byte2 := aStream basicNext.
>>>>>
>>>>> ... check if we have a two byte character, if yes return the character
>>>>>
>>>>> byte3 := aStream basicNext.
>>>>>
>>>>> ... check if we have a three byte character, if yes return the character
>>>>>
>>>>>
>>>>> byte4 := aStream basicNext.
>>>>>
>>>>> ... check if we have a four byte character, if yes return the character
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> nextFromStream: aStream
>>>>> | character1 value1 character2 value2 unicode character3 value3
>>>>> character4 value4 |
>>>>> aStream isBinary
>>>>> ifTrue: [ ^ aStream basicNext ].
>>>>> character1 := aStream basicNext.
>>>>> character1 isNil
>>>>> ifTrue: [ ^ nil ].
>>>>> value1 := character1 asciiValue.
>>>>> value1 <= 127
>>>>> ifTrue: [
>>>>> "1-byte character"
>>>>> ^ character1 ]. "at least 2-byte character"
>>>>> character2 := aStream basicNext.
>>>>> character2 isNil
>>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>>> value2 := character2 asciiValue.
>>>>> (value1 bitAnd: 16rE0) = 192
>>>>> ifTrue: [ ^ Unicode value: ((value1 bitAnd: 31) bitShift:
>>>>> 6) +
>>>>> (value2 bitAnd: 63) ]. "at least 3-byte character"
>>>>> character3 := aStream basicNext.
>>>>> character3 isNil
>>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>>> value3 := character3 asciiValue.
>>>>> (value1 bitAnd: 16rF0) = 224
>>>>> ifTrue: [ unicode := ((value1 bitAnd: 15) bitShift: 12) +
>>>>> ((value2
>>>>> bitAnd: 63) bitShift: 6) + (value3 bitAnd: 63) ].
>>>>> (value1 bitAnd: 16rF8) = 240
>>>>> ifTrue: [
>>>>> "4-byte character"
>>>>> character4 := aStream basicNext.
>>>>> character4 isNil
>>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>>> value4 := character4 asciiValue.
>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>> ((value2 bitAnd:
>>>>> 63) bitShift: 12)
>>>>> + ((value3 bitAnd: 63) bitShift: 6) +
>>>>> (value4 bitAnd: 63) ].
>>>>> unicode isNil
>>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>>> unicode > 16r10FFFD
>>>>> ifTrue: [ ^ self errorMalformedInput ].
>>>>> unicode = 16rFEFF
>>>>> ifTrue: [ ^ self nextFromStream: aStream ].
>>>>> ^ Unicode value: unicode
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/22/17, Tobias Pape <[hidden email]> wrote:
>>>>>>
>>>>>> On 22.01.2017, at 16:10, Levente Uzonyi <[hidden email]> wrote:
>>>>>>
>>>>>>> On Fri, 20 Jan 2017, Tobias Pape wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On 19.01.2017, at 23:30, Levente Uzonyi <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> On Thu, 19 Jan 2017, Tobias Pape wrote:
>>>>>>>>>> Thanks Jacob.
>>>>>>>>>> Any objections here I put this into trunk?
>>>>>>>>> Yep. TextConverters are intended to work with MultiByte*Streams
>>>>>>>>> only.
>>>>>>>>
>>>>>>>> Didn't know that.
>>>>>>>>
>>>>>>>>> Therefore #basicNext is expected to return a Character, provided the
>>>>>>>>> stream is not binary. This is why the #isBinary check is the first
>>>>>>>>> thing
>>>>>>>>> the method does.
>>>>>>>>
>>>>>>>> I see. however, using asInteger sounds more reasonable _even though_
>>>>>>>> it
>>>>>>>> is a character. Said bluntly, the responsibility of the TextConverter
>>>>>>>> is
>>>>>>>> to make Characters from that bloody numbers in that stream.
>>>>>>>> I was confused to see that asciiValue returns something >127 in the
>>>>>>>> first
>>>>>>>> place.
>>>>>>>
>>>>>>> #asInteger does the same thing as #asciiValue. While #asciiValue
>>>>>>> doesn't
>>>>>>> do what you would expect it to do, it has the advantage to clearly
>>>>>>> mark
>>>>>>> the class of the receiver (in this case).
>>>>>>>
>>>>>>
>>>>>> Yes, and that's exactly why we should use #asInteger. To _not_ limit
>>>>>> the
>>>>>> receiver.
>>>>>> Because the receiver isn't actually a Character, but some number,
>>>>>> encoded
>>>>>> in
>>>>>> a Character, whose meaning is to be determined by
>>>>>> this very method.
>>>>>>
>>>>>> Also, how do we know that _basic_Next will always return a Character?
>>>>>> (Yes, I know there's a binary check, but doesn't that only say
>>>>>> something
>>>>>> about #next, not #basicNext?)
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> If there are plans to make TextConverters work with more general
>>>>>>>>> streams, then I persume these changes won't be enough.
>>>>>>>>
>>>>>>>> Clearly.
>>>>>>>> But isn't this a step in the right direction?
>>>>>>>
>>>>>>> Yes and no. There are at least two ways to go:
>>>>>>>
>>>>>>> 1. Enhance the current stream library, even at the cost of breaking
>>>>>>> things.
>>>>>>> A patch here and there won't work. There are fundamental changes
>>>>>>> required,
>>>>>>> like stackable streams, to make it desirable to use it over other
>>>>>>> libraries.
>>>>>>>
>>>>>>> 2. Integrate an existing stream library with better features (e.g.
>>>>>>> Xtreams)
>>>>>>> If we were to do this, we could gradually migrate existing code to the
>>>>>>> new
>>>>>>> library, and finally make the current stream library unloadable.
>>>>>>
>>>>>> I like the idea of Xtreams, but I also like going baby steps.
>>>>>>
>>>>>> The changes here help at least one person, won't hurt others and seem
>>>>>> future
>>>>>> proof.
>>>>>> So?
>>>>>>
>>>>>> Best regards
>>>>>> -Tobias
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Levente
>>>>>>>
>>>>>>>>
>>>>>>>>> Levente
>>>>>>>>>> Looks good from here.
>>>>>>>>>> Best regards
>>>>>>>>>> -Tobias
>>>>>>>>>> On 19.01.2017, at 17:14, [hidden email] wrote:
>>>>>>>>>>> A new version of Multilingual was added to project The Inbox:
>>>>>>>>>>> http://source.squeak.org/inbox/Multilingual-jr.218.mcz
>>>>>>>>>>> ==================== Summary ====================
>>>>>>>>>>> Name: Multilingual-jr.218
>>>>>>>>>>> Author: jr
>>>>>>>>>>> Time: 19 January 2017, 5:14:23.763655 pm
>>>>>>>>>>> UUID: 36416c42-a4b4-554f-8203-aba25eee794f
>>>>>>>>>>> Ancestors: Multilingual-tfel.217
>>>>>>>>>>> support 'iso-8859-1' and do not let UTF8TextConverter expect that
>>>>>>>>>>> its
>>>>>>>>>>> input stream returns Characters from basicNext
>>>>>>>>>>> A stream implementation might always return bytes from basicNext
>>>>>>>>>>> and
>>>>>>>>>>> expect the conversion to Character to be done solely by the
>>>>>>>>>>> TextConverter, so use asInteger instead of asciiValue to support
>>>>>>>>>>> both
>>>>>>>>>>> cases. Convert back with asCharacter.
>>>>>>>>>>> =============== Diff against Multilingual-tfel.217 ===============
>>>>>>>>>>> Item was changed:
>>>>>>>>>>> ----- Method: Latin1TextConverter class>>encodingNames (in
>>>>>>>>>>> category
>>>>>>>>>>> 'utilities') -----
>>>>>>>>>>> encodingNames + ^ #('latin-1' 'latin1' 'iso-8859-1') copy.
>>>>>>>>>>> - ^ #('latin-1' 'latin1') copy.
>>>>>>>>>>> !
>>>>>>>>>>> Item was changed:
>>>>>>>>>>> ----- Method: UTF8TextConverter>>nextFromStream: (in category
>>>>>>>>>>> 'conversion') -----
>>>>>>>>>>> nextFromStream: aStream
>>>>>>>>>>>
>>>>>>>>>>> | char1 value1 char2 value2 unicode char3 value3 char4
>>>>>>>>>>> value4 |
>>>>>>>>>>> aStream isBinary ifTrue: [^ aStream basicNext].
>>>>>>>>>>> char1 := aStream basicNext.
>>>>>>>>>>> char1 ifNil:[^ nil].
>>>>>>>>>>> + value1 := char1 asInteger.
>>>>>>>>>>> - value1 := char1 asciiValue.
>>>>>>>>>>> value1 <= 127 ifTrue: [
>>>>>>>>>>> "1-byte char"
>>>>>>>>>>> + ^ char1 asCharacter
>>>>>>>>>>> - ^ char1
>>>>>>>>>>> ].
>>>>>>>>>>>
>>>>>>>>>>> "at least 2-byte char"
>>>>>>>>>>> char2 := aStream basicNext.
>>>>>>>>>>> + char2 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>>>> asCharacter)].
>>>>>>>>>>> + value2 := char2 asInteger.
>>>>>>>>>>> - char2 ifNil:[^self errorMalformedInput: (String with:
>>>>>>>>>>> char1)].
>>>>>>>>>>> - value2 := char2 asciiValue.
>>>>>>>>>>>
>>>>>>>>>>> (value1 bitAnd: 16rE0) = 192 ifTrue: [
>>>>>>>>>>> ^ Unicode value: ((value1 bitAnd: 31) bitShift: 6)
>>>>>>>>>>> + (value2
>>>>>>>>>>> bitAnd:
>>>>>>>>>>> 63).
>>>>>>>>>>> ].
>>>>>>>>>>>
>>>>>>>>>>> "at least 3-byte char"
>>>>>>>>>>> char3 := aStream basicNext.
>>>>>>>>>>> + char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>>>> asCharacter with: char2 asCharacter)].
>>>>>>>>>>> + value3 := char3 asInteger.
>>>>>>>>>>> - char3 ifNil:[^self errorMalformedInput: (String with: char1
>>>>>>>>>>> with:
>>>>>>>>>>> char2)].
>>>>>>>>>>> - value3 := char3 asciiValue.
>>>>>>>>>>> (value1 bitAnd: 16rF0) = 224 ifTrue: [
>>>>>>>>>>> unicode := ((value1 bitAnd: 15) bitShift: 12) +
>>>>>>>>>>> ((value2 bitAnd:
>>>>>>>>>>> 63)
>>>>>>>>>>> bitShift: 6)
>>>>>>>>>>> + (value3 bitAnd: 63).
>>>>>>>>>>> ].
>>>>>>>>>>>
>>>>>>>>>>> (value1 bitAnd: 16rF8) = 240 ifTrue: [
>>>>>>>>>>> "4-byte char"
>>>>>>>>>>> char4 := aStream basicNext.
>>>>>>>>>>> + char4 ifNil:[^self errorMalformedInput: (String
>>>>>>>>>>> with: char1
>>>>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>>>>> + value4 := char4 asInteger.
>>>>>>>>>>> - char4 ifNil:[^self errorMalformedInput: (String
>>>>>>>>>>> with: char1
>>>>>>>>>>> with:
>>>>>>>>>>> char2 with: char3)].
>>>>>>>>>>> - value4 := char4 asciiValue.
>>>>>>>>>>> unicode := ((value1 bitAnd: 16r7) bitShift: 18) +
>>>>>>>>>>> ((value2 bitAnd: 63)
>>>>>>>>>>> bitShift: 12) +
>>>>>>>>>>> ((value3 bitAnd: 63)
>>>>>>>>>>> bitShift: 6) +
>>>>>>>>>>> (value4 bitAnd: 63).
>>>>>>>>>>> ].
>>>>>>>>>>> + unicode ifNil:[^self errorMalformedInput: (String with:
>>>>>>>>>>> char1
>>>>>>>>>>> asCharacter with: char2 asCharacter with: char3 asCharacter)].
>>>>>>>>>>> - unicode ifNil:[^self errorMalformedInput: (String with:
>>>>>>>>>>> char1
>>>>>>>>>>> with:
>>>>>>>>>>> char2 with: char3)].
>>>>>>>>>>> unicode > 16r10FFFD ifTrue: [
>>>>>>>>>>> + ^self errorMalformedInput: (String with: char1
>>>>>>>>>>> asCharacter with:
>>>>>>>>>>> char2 asCharacter with: char3 asCharacter).
>>>>>>>>>>> - ^self errorMalformedInput: (String with: char1
>>>>>>>>>>> with: char2 with:
>>>>>>>>>>> char3).
>>>>>>>>>>> ].
>>>>>>>>>>>
>>>>>>>>>>> unicode = 16rFEFF ifTrue: [^ self nextFromStream: aStream].
>>>>>>>>>>> ^ Unicode value: unicode.
>>>>>>>>>>> !
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>