Encoding and conversion problem

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding and conversion problem

CyrilFerlicot
Hi,

I did a refactoring in Moose in order to use the encoding detector
that Sven did some weeks ago while reading a file.

With the latest stable version of ZincHTTPComponent, I can get the
encoding like this:

fileReference binaryReadStreamDo: [ :in | (ZnCharacterEncoder
detectEncoding: in upToEnd) ]

Since we need to read the files a lot, I save the identifier of the
encoder using the #identifier method. Then when I read I just want to
get the TextConverter corresponding to the encoder in order to read
the stream.

The problem is that in the case of a file encoded in ISO-8859-1, my
instance of ZnSimplifiedByteEncoder return 'iso88591' as identifier
and Latin1TextConverter does not have this encoding name in its
possibilities. Only 'iso-8859-1'.

Should we add 'iso88591' to the Latin1TextConverter? If yes, could we
backport this to Pharo 6 please?

--
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France

Reply | Threaded
Open this post in threaded view
|

Re: Encoding and conversion problem

Sven Van Caekenberghe-2

> On 18 Jul 2017, at 15:42, Cyril Ferlicot <[hidden email]> wrote:
>
> Hi,
>
> I did a refactoring in Moose in order to use the encoding detector
> that Sven did some weeks ago while reading a file.
>
> With the latest stable version of ZincHTTPComponent, I can get the
> encoding like this:
>
> fileReference binaryReadStreamDo: [ :in | (ZnCharacterEncoder
> detectEncoding: in upToEnd) ]
>
> Since we need to read the files a lot, I save the identifier of the
> encoder using the #identifier method. Then when I read I just want to
> get the TextConverter corresponding to the encoder in order to read
> the stream.
>
> The problem is that in the case of a file encoded in ISO-8859-1, my
> instance of ZnSimplifiedByteEncoder return 'iso88591' as identifier
> and Latin1TextConverter does not have this encoding name in its
> possibilities. Only 'iso-8859-1'.
>
> Should we add 'iso88591' to the Latin1TextConverter? If yes, could we
> backport this to Pharo 6 please?

These are all aliases [ see: https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ].

So you could add it, yes

But why use TextConverter at all ?

You could keep on using the alternative (more modern, cleaner) ZnCharacterEncoder hierarchy.

Just open your streams binary and wrap a ZnCharacterReadStream around them with the encoding of your choice.

fileReference binaryReadStreamDo: [ :in | (ZnCharacterReadStream on: in encoding: #latin1) ... ]

> --
> Cyril Ferlicot
> https://ferlicot.fr
>
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
>


Reply | Threaded
Open this post in threaded view
|

Re: Encoding and conversion problem

CyrilFerlicot
On Tue, Jul 18, 2017 at 3:54 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>
> These are all aliases [ see: https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ].
>
> So you could add it, yes
>
> But why use TextConverter at all ?
>

I used this because this is what I found while browsing for code. When
I did this code the CI was down I could not access to EnterprisePharo.

> You could keep on using the alternative (more modern, cleaner) ZnCharacterEncoder hierarchy.
>
> Just open your streams binary and wrap a ZnCharacterReadStream around them with the encoding of your choice.
>
> fileReference binaryReadStreamDo: [ :in | (ZnCharacterReadStream on: in encoding: #latin1) ... ]
>

I just tried to use this but it broke my code. For example,
ZnCharacterReadStream does not understand #position: while the
ReadStream hierarchy understand it.



--
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France

Reply | Threaded
Open this post in threaded view
|

Re: Encoding and conversion problem

Sven Van Caekenberghe-2

> On 18 Jul 2017, at 16:10, Cyril Ferlicot <[hidden email]> wrote:
>
> On Tue, Jul 18, 2017 at 3:54 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>>
>> These are all aliases [ see: https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ].
>>
>> So you could add it, yes
>>
>> But why use TextConverter at all ?
>>
>
> I used this because this is what I found while browsing for code. When
> I did this code the CI was down I could not access to EnterprisePharo.
>
>> You could keep on using the alternative (more modern, cleaner) ZnCharacterEncoder hierarchy.
>>
>> Just open your streams binary and wrap a ZnCharacterReadStream around them with the encoding of your choice.
>>
>> fileReference binaryReadStreamDo: [ :in | (ZnCharacterReadStream on: in encoding: #latin1) ... ]
>>
>
> I just tried to use this but it broke my code. For example,
> ZnCharacterReadStream does not understand #position: while the
> ReadStream hierarchy understand it.

In general, the stream API is much, much too wide, IMHO. Not all streams (hence the word stream) can see all their content all the time (think of network or encrypted streams, most work with a sliding buffer, but in general the idea of a stream is to *not* hold everything in memory at the same time). If you look at it that way, many operations stop making sense. Positioning is one of them. Even in Java not every stream is positionable, and even if they are positionable, the positioning can fail because you go too far (back).

In all parsing code that I write I try to only use 1 item look ahead (a single item buffer), which means you can #peek 1 item, that's it.

I know that we have parsing code that was written with other assumptions. Such code is not stream based, IMHO.

The solution (or rather workaround) is easy: read everything #upToEnd and turn it into a good old ReadStream again. You will give the use infinite positioning over all of the content, at the cost of more memory usage.

> --
> Cyril Ferlicot
> https://ferlicot.fr
>
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
>


Reply | Threaded
Open this post in threaded view
|

Re: Encoding and conversion problem

CyrilFerlicot
On Tue, Jul 18, 2017 at 4:27 PM, Sven Van Caekenberghe <[hidden email]> wrote:
>
> In general, the stream API is much, much too wide, IMHO. Not all streams (hence the word stream) can see all their content all the time (think of network or encrypted streams, most work with a sliding buffer, but in general the idea of a stream is to *not* hold everything in memory at the same time). If you look at it that way, many operations stop making sense. Positioning is one of them. Even in Java not every stream is positionable, and even if they are positionable, the positioning can fail because you go too far (back).
>
> In all parsing code that I write I try to only use 1 item look ahead (a single item buffer), which means you can #peek 1 item, that's it.
>
> I know that we have parsing code that was written with other assumptions. Such code is not stream based, IMHO.
>
> The solution (or rather workaround) is easy: read everything #upToEnd and turn it into a good old ReadStream again. You will give the use infinite positioning over all of the content, at the cost of more memory usage.


When I'll have time I'll check if we can avoid the use of #position:,
but in any case we cannot use #upToEnd because we need to stay
optimized.

At the begining we where doing that but it's too long for our algos.
Sometime we just want to read what it between the line 3, column 5 and
the line 6, column 9. If the file has thousands of line we lose to
much time to read everything.


--
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France