Issue 3360 in pharo: TextConverter handling of binary streams is wrong

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue 3360 in pharo: TextConverter handling of binary streams is wrong

pharo
Status: New
Owner: ----

New issue 3360 by sven.van.caekenberghe: TextConverter handling of binary  
streams is wrong
http://code.google.com/p/pharo/issues/detail?id=3360

It seems that the way binary (#isBinary true) streams are handled by  
TextConverter and its subclasses is wrong. When given a binary stream, the  
core text converter methods (#nextPut:toStream and #nextFromStream:) simply  
do no longer encode or decode at all.

Moreover, the unit test UTF8TextConverter>>#testPutSingleCharacter seems  
plain wrong. The actual encoded bytes should be #[97 226 130 172].

However, this behavior seems to be added by design, so it is hard to  
estimate the impact of changing this.

It is currently very ugly to get a binary UTF-8 encoding, one has to write  
to a character stream and then turn those characters into bytes.

I wrote an alternative UTF-8 encoder as a support class to the Zinc HTTP  
Components (http://www.squeaksource.com/ZincHTTPComponents.html) together  
with the following unit test:

testUTF8Encoder
        "The examples are taken from  
http://en.wikipedia.org/wiki/UTF-8#Description"
       
        | encoder inputBytes outputBytes inputString outputString |
        encoder := ZnUTF8Encoder new.
        inputString := String with: $$ with: (Unicode value: 16r00A2) with:  
(Unicode value: 16r20AC) with: (Unicode value: 16r024B62).
        inputBytes := #[16r24 16rC2 16rA2 16rE2 16r82 16rAC 16rF0 16rA4 16rAD  
16rA2].
        outputBytes := self encodeString: inputString with: encoder.
        self assert: outputBytes = inputBytes.
        outputString := self decodeBytes: inputBytes with: encoder.
        self assert: outputString = inputString

based on the helper methods:

encodeString: string with: encoder
        ^ ByteArray streamContents: [ :stream |
                string do: [ :each |
                        encoder nextPut: each toStream: stream ] ]

decodeBytes: bytes with: encoder
        | input |
        input := bytes readStream.
        ^ String streamContents: [ :stream |
                [ input atEnd ] whileFalse: [
                        stream nextPut: (encoder nextFromStream: input) ] ]

The new encoder code is simpler, but might not handle everything that is  
needed (leading chars, language codes), but is all that still needed ?

Sven






Reply | Threaded
Open this post in threaded view
|

Re: Issue 3360 in pharo: TextConverter handling of binary streams is wrong

pharo

Comment #1 on issue 3360 by sven.van.caekenberghe: TextConverter handling  
of binary streams is wrong
http://code.google.com/p/pharo/issues/detail?id=3360

ZnUTF8Encoder seems about 20% faster than UTF8Converter based on the  
following benchmark (which adds too many characters with multibyte  
encodings I guess):

[ 100 timesRepeat: [ZnCharacterEncoderTests new testUTF8EncoderAuto] ]  
timeToRun.

[ 100 timesRepeat:
[ | in tmp out |
in := String withAll: ((1 to: 3072) collect: [ :each | Character value:  
each ]).
tmp := in convertToWithConverter: UTF8TextConverter new.
out := tmp convertFromWithConverter: UTF8TextConverter new.
self assert: in = out ] ] timeToRun.

Note that tmp is a ByteString not a ByteArray..

Sven