TextConverter handling of binary streams

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

TextConverter handling of binary streams

Sven Van Caekenberghe
Hi,

TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:

ByteArray streamContents: [ :stream | | encoder |
        encoder := UTF8TextConverter new.
        'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].

 #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]

(String streamContents: [ :stream | | encoder |
        encoder := UTF8TextConverter new.
        'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.

 #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]

The first answer is incorrect, the second is correct (as far as I understand it).

This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:

nextPut: aCharacter toStream: aStream
        | leadingChar nBytes mask shift ucs2code |
        aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
        leadingChar := aCharacter leadingChar.
        (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
                aStream basicNextPut: aCharacter.
                ^ aStream.
        ].

        "leadingChar > 3 ifTrue: [^ aStream]."

        ucs2code := aCharacter asUnicode.
        ucs2code ifNil: [^ aStream].

        nBytes := ucs2code highBit + 3 // 5.
        mask := #(128 192 224 240 248 252 254 255) at: nBytes.
        shift := nBytes - 1 * -6.
        aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
        2 to: nBytes do: [:i |
                shift := shift + 6.
                aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
        ].

        ^ aStream.

I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !

The same is true for the other converters as well as for #nextFromStream:.

Does anyone know why that is the case ?

And if it is by design, how should one do UTF8 encoding on a binary stream ??

Thx,

Sven

PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.



Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Stéphane Ducasse
sven

I'm terribly and more than that busy until mid or more dec.
Now did you check if the behavior is the same in squeak?
S.

On Nov 29, 2010, at 3:23 PM, Sven Van Caekenberghe wrote:

> Hi,
>
> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
>
> ByteArray streamContents: [ :stream | | encoder |
> encoder := UTF8TextConverter new.
> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
>
> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>
> (String streamContents: [ :stream | | encoder |
> encoder := UTF8TextConverter new.
> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
>
> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
>
> The first answer is incorrect, the second is correct (as far as I understand it).
>
> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
>
> nextPut: aCharacter toStream: aStream
> | leadingChar nBytes mask shift ucs2code |
> aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
> leadingChar := aCharacter leadingChar.
> (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
> aStream basicNextPut: aCharacter.
> ^ aStream.
> ].
>
> "leadingChar > 3 ifTrue: [^ aStream]."
>
> ucs2code := aCharacter asUnicode.
> ucs2code ifNil: [^ aStream].
>
> nBytes := ucs2code highBit + 3 // 5.
> mask := #(128 192 224 240 248 252 254 255) at: nBytes.
> shift := nBytes - 1 * -6.
> aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
> 2 to: nBytes do: [:i |
> shift := shift + 6.
> aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
> ].
>
> ^ aStream.
>
> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !
>
> The same is true for the other converters as well as for #nextFromStream:.
>
> Does anyone know why that is the case ?
>
> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>
> Thx,
>
> Sven
>
> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Philippe Marschall-2
In reply to this post by Sven Van Caekenberghe
On 11/29/2010 03:23 PM, Sven Van Caekenberghe wrote:

> Hi,
>
> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
>
> ByteArray streamContents: [ :stream | | encoder |
> encoder := UTF8TextConverter new.
> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
>
>  #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>
> (String streamContents: [ :stream | | encoder |
> encoder := UTF8TextConverter new.
> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
>
>  #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
>
> The first answer is incorrect, the second is correct (as far as I understand it).

I would agree.

> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
>
> nextPut: aCharacter toStream: aStream
> | leadingChar nBytes mask shift ucs2code |
> aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
> leadingChar := aCharacter leadingChar.
> (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
> aStream basicNextPut: aCharacter.
> ^ aStream.
> ].
>
> "leadingChar > 3 ifTrue: [^ aStream]."
>
> ucs2code := aCharacter asUnicode.
> ucs2code ifNil: [^ aStream].
>
> nBytes := ucs2code highBit + 3 // 5.
> mask := #(128 192 224 240 248 252 254 255) at: nBytes.
> shift := nBytes - 1 * -6.
> aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
> 2 to: nBytes do: [:i |
> shift := shift + 6.
> aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
> ].
>
> ^ aStream.
>
> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !

There are many cases where Squeak/Pharo do silent coercion, where
obviously something went wrong (you pass a ByteArray where a String is
expected or vice versa) and the system tries to be clever about it and
do something else than raising an exception and forcing you to fix your
code. The result is often wrong because the system can not guess what
you wanted to do.

> The same is true for the other converters as well as for #nextFromStream:.
>
> Does anyone know why that is the case ?
>
> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>
> Thx,
>
> Sven
>
> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.

I find that strange as well.

Cheers
Philippe


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Sven Van Caekenberghe
In reply to this post by Stéphane Ducasse
I understand very well that you are very busy, Stéphane, I am not asking for a solution.

It is exactly the same in Squeak.

Like Philippe said, it looks wrong.

On 29 Nov 2010, at 20:12, Stéphane Ducasse wrote:

> sven
>
> I'm terribly and more than that busy until mid or more dec.
> Now did you check if the behavior is the same in squeak?
> S.
>
> On Nov 29, 2010, at 3:23 PM, Sven Van Caekenberghe wrote:
>
>> Hi,
>>
>> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
>>
>> ByteArray streamContents: [ :stream | | encoder |
>> encoder := UTF8TextConverter new.
>> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
>>
>> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>>
>> (String streamContents: [ :stream | | encoder |
>> encoder := UTF8TextConverter new.
>> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
>>
>> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
>>
>> The first answer is incorrect, the second is correct (as far as I understand it).
>>
>> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
>>
>> nextPut: aCharacter toStream: aStream
>> | leadingChar nBytes mask shift ucs2code |
>> aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
>> leadingChar := aCharacter leadingChar.
>> (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
>> aStream basicNextPut: aCharacter.
>> ^ aStream.
>> ].
>>
>> "leadingChar > 3 ifTrue: [^ aStream]."
>>
>> ucs2code := aCharacter asUnicode.
>> ucs2code ifNil: [^ aStream].
>>
>> nBytes := ucs2code highBit + 3 // 5.
>> mask := #(128 192 224 240 248 252 254 255) at: nBytes.
>> shift := nBytes - 1 * -6.
>> aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
>> 2 to: nBytes do: [:i |
>> shift := shift + 6.
>> aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
>> ].
>>
>> ^ aStream.
>>
>> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !
>>
>> The same is true for the other converters as well as for #nextFromStream:.
>>
>> Does anyone know why that is the case ?
>>
>> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>>
>> Thx,
>>
>> Sven
>>
>> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.
>>
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Sven Van Caekenberghe
In reply to this post by Philippe Marschall-2
Philippe,

I would even go further, the test UTF8TextConverterTest>>#testPutSingleCharacter is plain wrong, it seems to have been written as an afterthought:

testPutSingleCharacter
        | actual |
        actual := ByteArray streamContents: [ :stream |
                | converter |
                converter := UTF8TextConverter new.
                converter
                        nextPut: $a
                        toStream: stream.
                converter
                        nextPut: (Unicode value: 16r20AC)
                        toStream: stream ].
        self assert: actual = #[97 0 0 32 172]

The correct result is #[97 226 130 172] !!

If you would give these bytes to any other system, they would never be able to decode them as UTF-8.

Sven

On 30 Nov 2010, at 08:46, Philippe Marschall wrote:

> On 11/29/2010 03:23 PM, Sven Van Caekenberghe wrote:
>> Hi,
>>
>> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
>>
>> ByteArray streamContents: [ :stream | | encoder |
>> encoder := UTF8TextConverter new.
>> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
>>
>> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>>
>> (String streamContents: [ :stream | | encoder |
>> encoder := UTF8TextConverter new.
>> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
>>
>> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
>>
>> The first answer is incorrect, the second is correct (as far as I understand it).
>
> I would agree.
>
>> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
>>
>> nextPut: aCharacter toStream: aStream
>> | leadingChar nBytes mask shift ucs2code |
>> aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
>> leadingChar := aCharacter leadingChar.
>> (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
>> aStream basicNextPut: aCharacter.
>> ^ aStream.
>> ].
>>
>> "leadingChar > 3 ifTrue: [^ aStream]."
>>
>> ucs2code := aCharacter asUnicode.
>> ucs2code ifNil: [^ aStream].
>>
>> nBytes := ucs2code highBit + 3 // 5.
>> mask := #(128 192 224 240 248 252 254 255) at: nBytes.
>> shift := nBytes - 1 * -6.
>> aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
>> 2 to: nBytes do: [:i |
>> shift := shift + 6.
>> aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
>> ].
>>
>> ^ aStream.
>>
>> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !
>
> There are many cases where Squeak/Pharo do silent coercion, where
> obviously something went wrong (you pass a ByteArray where a String is
> expected or vice versa) and the system tries to be clever about it and
> do something else than raising an exception and forcing you to fix your
> code. The result is often wrong because the system can not guess what
> you wanted to do.
>
>> The same is true for the other converters as well as for #nextFromStream:.
>>
>> Does anyone know why that is the case ?
>>
>> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>>
>> Thx,
>>
>> Sven
>>
>> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.
>
> I find that strange as well.
>
> Cheers
> Philippe


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Stéphane Ducasse
In reply to this post by Stéphane Ducasse

On Nov 30, 2010, at 11:46 AM, Sven Van Caekenberghe wrote:

> I understand very well that you are very busy, Stéphane, I am not asking for a solution.

:)
but I would love to have time :)

>
> It is exactly the same in Squeak.
>
> Like Philippe said, it looks wrong.

so open a ticket

>
> On 29 Nov 2010, at 20:12, Stéphane Ducasse wrote:
>
>> sven
>>
>> I'm terribly and more than that busy until mid or more dec.
>> Now did you check if the behavior is the same in squeak?
>> S.
>>
>> On Nov 29, 2010, at 3:23 PM, Sven Van Caekenberghe wrote:
>>
>>> Hi,
>>>
>>> TextConverter and its subclasses seem to break the contract of #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider the following two examples:
>>>
>>> ByteArray streamContents: [ :stream | | encoder |
>>> encoder := UTF8TextConverter new.
>>> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ].
>>>
>>> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>>>
>>> (String streamContents: [ :stream | | encoder |
>>> encoder := UTF8TextConverter new.
>>> 'élève en français' do: [ :each | encoder nextPut: each toStream: stream ] ]) asByteArray.
>>>
>>> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105 115]
>>>
>>> The first answer is incorrect, the second is correct (as far as I understand it).
>>>
>>> This is apparently on purpose, from the implementation of, for example, UTF8TextConverter>>#nextPut:toStream:
>>>
>>> nextPut: aCharacter toStream: aStream
>>> | leadingChar nBytes mask shift ucs2code |
>>> aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
>>> leadingChar := aCharacter leadingChar.
>>> (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
>>> aStream basicNextPut: aCharacter.
>>> ^ aStream.
>>> ].
>>>
>>> "leadingChar > 3 ifTrue: [^ aStream]."
>>>
>>> ucs2code := aCharacter asUnicode.
>>> ucs2code ifNil: [^ aStream].
>>>
>>> nBytes := ucs2code highBit + 3 // 5.
>>> mask := #(128 192 224 240 248 252 254 255) at: nBytes.
>>> shift := nBytes - 1 * -6.
>>> aStream basicNextPut: (Character value: (ucs2code bitShift: shift) + mask).
>>> 2 to: nBytes do: [:i |
>>> shift := shift + 6.
>>> aStream basicNextPut: (Character value: ((ucs2code bitShift: shift) bitAnd: 63) + 128).
>>> ].
>>>
>>> ^ aStream.
>>>
>>> I would say that the contract of #nextPut:toStream: is to take a Character object and write a binary representation using a specific encoding to a stream. However, when given a #isBinary stream, it does no longer do any encoding at all !
>>>
>>> The same is true for the other converters as well as for #nextFromStream:.
>>>
>>> Does anyone know why that is the case ?
>>>
>>> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>>>
>>> Thx,
>>>
>>> Sven
>>>
>>> PS: If others also think this is strange, I could make an issue, I am just not sure this is a bug.
>>>
>>>
>>>
>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Stéphane Ducasse
In reply to this post by Philippe Marschall-2
fix fix fix :)

Stef

> I would even go further, the test UTF8TextConverterTest>>#testPutSingleCharacter is plain wrong, it seems to have been written as an afterthought:
>
> testPutSingleCharacter
> | actual |
> actual := ByteArray streamContents: [ :stream |
> | converter |
> converter := UTF8TextConverter new.
> converter
> nextPut: $a
> toStream: stream.
> converter
> nextPut: (Unicode value: 16r20AC)
> toStream: stream ].
> self assert: actual = #[97 0 0 32 172]
>
> The correct result is #[97 226 130 172] !!
>
> If you would give these bytes to any other system, they would never be able to decode them as UTF-8.


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter handling of binary streams

Sven Van Caekenberghe

On 30 Nov 2010, at 12:20, Stéphane Ducasse wrote:

> fix fix fix :)

I created issue #3360.

I already knew that you would be asking for code, so yesterday I wrote an alternative UTF-8 encoder as a support class to Zinc HTTP Components.

The problem is that what we are now calling 'wrong' seems to be added 'by design'. The question is why and what will be the impact of changing that behavior ?

Sven