Streaming over a UTF-8 encoded file using upToAll:

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
bpi
Reply | Threaded
Open this post in threaded view
|

Streaming over a UTF-8 encoded file using upToAll:

bpi
Hi all,

I want to stream over a file which is encoded in UTF-8 and have a problem with upToAll: answering more than it should.

Here is a minimal test:
'ße' readStream upToAll: 'e'.
This answers 'ß' as expected.

I wrote those two characters to a file:
FileStream forceNewFileNamed: 'test' do: [ :stream | stream nextPutAll: 'ße' ].

I opened it using a text editor to verify the encoding. It is UTF-8 as expected.

Now if I do the following...
'test' asFileReference readStreamDo: [ :stream | stream upToAll: 'e' ].
... I get 'ße' instead of just 'ß'.

I tried explicitly setting a UTF8TextConverter:
'test' asFileReference readStreamDo: [ :stream | stream converter: UTF8TextConverter new; upToAll: 'e' ].
However, the result is still 'ße'.

If I read the entire file first using #contentsOfEntireFile first it works:
('test' asFileReference readStreamDo: [ :stream | stream contentsOfEntireFile ]) readStream upToAll: 'e'
However, that defies the purpose of streaming.

Can someone tell me what I am doing wrong?

I am on Pharo 6.1 on a Mac.

Bernhard
Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

Henrik-Nergaard
Hi,

#upTo: works fine.

'test' asFileReference readStreamDo: [ :stream | stream converter:
UTF8TextConverter new; upTo: $e ].  "'ß'"

It looks like PositionableStream>>#upToAll: assumes a 1 to 1 map per item,
and only takes the difference between current position up to the pattern
when found.

Best regards,
Henrik



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

bpi
Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

bpi
Hi Henrik,

Thanks for your answer. Sounds like a bug, then. :-/

Cheers,
Bernhard

> Am 28.12.2017 um 20:31 schrieb Henrik-Nergaard <[hidden email]>:
>
> Hi,
>
> #upTo: works fine.
>
> 'test' asFileReference readStreamDo: [ :stream | stream converter:
> UTF8TextConverter new; upTo: $e ].  "'ß'"
>
> It looks like PositionableStream>>#upToAll: assumes a 1 to 1 map per item,
> and only takes the difference between current position up to the pattern
> when found.
>
> Best regards,
> Henrik


bpi
Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

bpi
I just checked and the bug in #upToAll: is still there in Pharo 7.

Bernhard

> Am 29.12.2017 um 20:26 schrieb Bernhard Pieber <[hidden email]>:
>
> Hi Henrik,
>
> Thanks for your answer. Sounds like a bug, then. :-/
>
> Cheers,
> Bernhard
>
>> Am 28.12.2017 um 20:31 schrieb Henrik-Nergaard <[hidden email]>:
>>
>> Hi,
>>
>> #upTo: works fine.
>>
>> 'test' asFileReference readStreamDo: [ :stream | stream converter:
>> UTF8TextConverter new; upTo: $e ].  "'ß'"
>>
>> It looks like PositionableStream>>#upToAll: assumes a 1 to 1 map per item,
>> and only takes the difference between current position up to the pattern
>> when found.
>>
>> Best regards,
>> Henrik


bpi
Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

bpi
I created an issue for this:
https://pharo.fogbugz.com/f/cases/20898

Bernhard

> Am 30.12.2017 um 12:01 schrieb Bernhard Pieber <[hidden email]>:
>
> I just checked and the bug in #upToAll: is still there in Pharo 7.
>
> Bernhard
>
>> Am 29.12.2017 um 20:26 schrieb Bernhard Pieber <[hidden email]>:
>>
>> Hi Henrik,
>>
>> Thanks for your answer. Sounds like a bug, then. :-/
>>
>> Cheers,
>> Bernhard
>>
>>> Am 28.12.2017 um 20:31 schrieb Henrik-Nergaard <[hidden email]>:
>>>
>>> Hi,
>>>
>>> #upTo: works fine.
>>>
>>> 'test' asFileReference readStreamDo: [ :stream | stream converter:
>>> UTF8TextConverter new; upTo: $e ].  "'ß'"
>>>
>>> It looks like PositionableStream>>#upToAll: assumes a 1 to 1 map per item,
>>> and only takes the difference between current position up to the pattern
>>> when found.
>>>
>>> Best regards,
>>> Henrik
>


Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

Henrik-Nergaard
In reply to this post by bpi
Here is a fix.

------------------------------------
PositionableStream>>#upToAll: aCollection
        "Answer a subcollection from the current access position to the occurrence
(if any, but not inclusive) of aCollection. If aCollection is not in the
stream, answer the entire rest of the stream."

        | output pattern |
       
        aCollection ifEmpty: [ ^ collection species empty ].
       
        output := (collection species new: 100) writeStream.
        pattern := aCollection readStream.
       
        [ pattern atEnd ] whileFalse: [ | item |
                self atEnd ifTrue: [
                        output next: pattern position putAll: aCollection startingAt: 1.
                        ^ output contents
                ].

                item := self next.
                (pattern peekFor: item) ifFalse: [
                        output
                                next: pattern position putAll: aCollection startingAt: 1;
                          nextPut: item.
                                       
                        pattern reset.
                ].
        ].

        ^ output contents
------------------------------------

Best regards,
Henrik





--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html

Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

Stephane Ducasse-3
In reply to this post by bpi
Thanks.
What I understood is that positionableReadStream does not work with
variable elements like utf-8 because there is no 1 to 1 mapping.

On Sat, Dec 30, 2017 at 12:54 PM, Bernhard Pieber <[hidden email]> wrote:

> I created an issue for this:
> https://pharo.fogbugz.com/f/cases/20898
>
> Bernhard
>
>> Am 30.12.2017 um 12:01 schrieb Bernhard Pieber <[hidden email]>:
>>
>> I just checked and the bug in #upToAll: is still there in Pharo 7.
>>
>> Bernhard
>>
>>> Am 29.12.2017 um 20:26 schrieb Bernhard Pieber <[hidden email]>:
>>>
>>> Hi Henrik,
>>>
>>> Thanks for your answer. Sounds like a bug, then. :-/
>>>
>>> Cheers,
>>> Bernhard
>>>
>>>> Am 28.12.2017 um 20:31 schrieb Henrik-Nergaard <[hidden email]>:
>>>>
>>>> Hi,
>>>>
>>>> #upTo: works fine.
>>>>
>>>> 'test' asFileReference readStreamDo: [ :stream | stream converter:
>>>> UTF8TextConverter new; upTo: $e ].  "'ß'"
>>>>
>>>> It looks like PositionableStream>>#upToAll: assumes a 1 to 1 map per item,
>>>> and only takes the difference between current position up to the pattern
>>>> when found.
>>>>
>>>> Best regards,
>>>> Henrik
>>
>
>

bpi
Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

bpi
In reply to this post by Henrik-Nergaard
Hi Henrik,

Thanks for the fix. I just saw it today. In the meantime I have created a pull request with another fix. To be honest, I have just taken the working implementation from Squeak:
https://github.com/pharo-project/pharo/pull/632

Alas, the CI check failed for some reason I don't understand. :-/

Happy New Year!
Bernhard

> Am 30.12.2017 um 13:32 schrieb Henrik-Nergaard <[hidden email]>:
>
> Here is a fix.
>
> ------------------------------------
> PositionableStream>>#upToAll: aCollection
> "Answer a subcollection from the current access position to the occurrence
> (if any, but not inclusive) of aCollection. If aCollection is not in the
> stream, answer the entire rest of the stream."
>
> | output pattern |
>
> aCollection ifEmpty: [ ^ collection species empty ].
>
> output := (collection species new: 100) writeStream.
> pattern := aCollection readStream.
>
> [ pattern atEnd ] whileFalse: [ | item |
> self atEnd ifTrue: [
> output next: pattern position putAll: aCollection startingAt: 1.
> ^ output contents
> ].
>
> item := self next.
> (pattern peekFor: item) ifFalse: [
> output
> next: pattern position putAll: aCollection startingAt: 1;
> nextPut: item.
>
> pattern reset.
> ].
> ].
>
> ^ output contents
> ------------------------------------
>
> Best regards,
> Henrik


Reply | Threaded
Open this post in threaded view
|

Re: Streaming over a UTF-8 encoded file using upToAll:

Stephane Ducasse-3
Thanks for the submission we will check.


On Sun, Dec 31, 2017 at 4:26 PM, Bernhard Pieber <[hidden email]> wrote:

> Hi Henrik,
>
> Thanks for the fix. I just saw it today. In the meantime I have created a pull request with another fix. To be honest, I have just taken the working implementation from Squeak:
> https://github.com/pharo-project/pharo/pull/632
>
> Alas, the CI check failed for some reason I don't understand. :-/
>
> Happy New Year!
> Bernhard
>
>> Am 30.12.2017 um 13:32 schrieb Henrik-Nergaard <[hidden email]>:
>>
>> Here is a fix.
>>
>> ------------------------------------
>> PositionableStream>>#upToAll: aCollection
>>       "Answer a subcollection from the current access position to the occurrence
>> (if any, but not inclusive) of aCollection. If aCollection is not in the
>> stream, answer the entire rest of the stream."
>>
>>       | output pattern |
>>
>>       aCollection ifEmpty: [ ^ collection species empty ].
>>
>>       output := (collection species new: 100) writeStream.
>>       pattern := aCollection readStream.
>>
>>       [ pattern atEnd ] whileFalse: [ | item |
>>               self atEnd ifTrue: [
>>                       output next: pattern position putAll: aCollection startingAt: 1.
>>                       ^ output contents
>>               ].
>>
>>               item := self next.
>>               (pattern peekFor: item) ifFalse: [
>>                       output
>>                               next: pattern position putAll: aCollection startingAt: 1;
>>                               nextPut: item.
>>
>>                       pattern reset.
>>               ].
>>       ].
>>
>>       ^ output contents
>> ------------------------------------
>>
>> Best regards,
>> Henrik
>
>