How to detect encoding of a file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

How to detect encoding of a file

CyrilFerlicot
Hello,

We have a problem using Moose because we have files which we don't know
the encoding. Currently we have this implementation to get the content
of a file:

completeText
  self fileReference exists ifFalse: [ ^ '' ].
  ^ self fileReference readStreamDo: [ :s |
    [ s contents ]
      on: Error
      do: [ [ s converter: Latin1TextConverter new; contents ]
        on: Error
        do: [ '' ] ] ]

But, we have a problem because we have currently some files at
Synectique in ISO-8859-1. The problem is that #contents is able to read
some of the files without throwing an error, but the content is not
right because it is not the good encoding.

Thus I wonder if it is possible to get the Encoding of a FileReference
in Pharo to be able to read the file with the right encoding? Something
like the bash command `file -I myFile.txt`.

--
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France


signature.asc (817 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

Sven Van Caekenberghe-2
Hi Cyril,

I want to try to write such a detector. I'll get back to you.

Any chance you could give me (part of) a file that causes you trouble (one that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ?

Sven

> On 3 May 2017, at 11:40, Cyril Ferlicot D. <[hidden email]> wrote:
>
> Hello,
>
> We have a problem using Moose because we have files which we don't know
> the encoding. Currently we have this implementation to get the content
> of a file:
>
> completeText
>  self fileReference exists ifFalse: [ ^ '' ].
>  ^ self fileReference readStreamDo: [ :s |
>    [ s contents ]
>      on: Error
>      do: [ [ s converter: Latin1TextConverter new; contents ]
>        on: Error
>        do: [ '' ] ] ]
>
> But, we have a problem because we have currently some files at
> Synectique in ISO-8859-1. The problem is that #contents is able to read
> some of the files without throwing an error, but the content is not
> right because it is not the good encoding.
>
> Thus I wonder if it is possible to get the Encoding of a FileReference
> in Pharo to be able to read the file with the right encoding? Something
> like the bash command `file -I myFile.txt`.
>
> --
> Cyril Ferlicot
> https://ferlicot.fr
>
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
>


Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

Ben Coman
I'm not sure, but maybe useful...

cheers -ben

On Wed, May 3, 2017 at 6:18 PM, Sven Van Caekenberghe <[hidden email]> wrote:
Hi Cyril,

I want to try to write such a detector. I'll get back to you.

Any chance you could give me (part of) a file that causes you trouble (one that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ?

Sven

> On 3 May 2017, at 11:40, Cyril Ferlicot D. <[hidden email]> wrote:
>
> Hello,
>
> We have a problem using Moose because we have files which we don't know
> the encoding. Currently we have this implementation to get the content
> of a file:
>
> completeText
>  self fileReference exists ifFalse: [ ^ '' ].
>  ^ self fileReference readStreamDo: [ :s |
>    [ s contents ]
>      on: Error
>      do: [ [ s converter: Latin1TextConverter new; contents ]
>        on: Error
>        do: [ '' ] ] ]
>
> But, we have a problem because we have currently some files at
> Synectique in ISO-8859-1. The problem is that #contents is able to read
> some of the files without throwing an error, but the content is not
> right because it is not the good encoding.
>
> Thus I wonder if it is possible to get the Encoding of a FileReference
> in Pharo to be able to read the file with the right encoding? Something
> like the bash command `file -I myFile.txt`.
>
> --
> Cyril Ferlicot
> https://ferlicot.fr
>
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
>



Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

Sven Van Caekenberghe-2
In reply to this post by Sven Van Caekenberghe-2

> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote:
>
> Hi Cyril,
>
> I want to try to write such a detector. I'll get back to you.

I added the following (Zn #bleedingEdge):

===
Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
Author: SvenVanCaekenberghe
Time: 3 May 2017, 4:30:44.081888 pm
UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48

Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes

Add ZnCharacterEncoderTests>>#testDetectEncoding

Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder

Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
===
Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
Author: SvenVanCaekenberghe
Time: 3 May 2017, 4:31:09.469852 pm
UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30

Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes

Add ZnCharacterEncoderTests>>#testDetectEncoding

Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder

Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
===


Now you can do the following:

ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]).

(FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
        | bytes encoder |
        bytes := in upToEnd.
        encoder := ZnCharacterEncoder detectEncoding: bytes.
        encoder decodeBytes: bytes ].

It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.

You can give the whole contents to the detector, or just a header.

I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML).

Sven

> Any chance you could give me (part of) a file that causes you trouble (one that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ?
>
> Sven
>
>> On 3 May 2017, at 11:40, Cyril Ferlicot D. <[hidden email]> wrote:
>>
>> Hello,
>>
>> We have a problem using Moose because we have files which we don't know
>> the encoding. Currently we have this implementation to get the content
>> of a file:
>>
>> completeText
>> self fileReference exists ifFalse: [ ^ '' ].
>> ^ self fileReference readStreamDo: [ :s |
>>   [ s contents ]
>>     on: Error
>>     do: [ [ s converter: Latin1TextConverter new; contents ]
>>       on: Error
>>       do: [ '' ] ] ]
>>
>> But, we have a problem because we have currently some files at
>> Synectique in ISO-8859-1. The problem is that #contents is able to read
>> some of the files without throwing an error, but the content is not
>> right because it is not the good encoding.
>>
>> Thus I wonder if it is possible to get the Encoding of a FileReference
>> in Pharo to be able to read the file with the right encoding? Something
>> like the bash command `file -I myFile.txt`.
>>
>> --
>> Cyril Ferlicot
>> https://ferlicot.fr
>>
>> http://www.synectique.eu
>> 2 rue Jacques Prévert 01,
>> 59650 Villeneuve d'ascq France
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

CyrilFerlicot
Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit :

>
>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote:
>>
>> Hi Cyril,
>>
>> I want to try to write such a detector. I'll get back to you.
>
> I added the following (Zn #bleedingEdge):
>
> ===
> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:30:44.081888 pm
> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
>
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>
> Add ZnCharacterEncoderTests>>#testDetectEncoding
>
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:31:09.469852 pm
> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
>
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>
> Add ZnCharacterEncoderTests>>#testDetectEncoding
>
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
>
>
> Now you can do the following:
>
> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]).
>
> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
> | bytes encoder |
> bytes := in upToEnd.
> encoder := ZnCharacterEncoder detectEncoding: bytes.
> encoder decodeBytes: bytes ].
>
> It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
>
> You can give the whole contents to the detector, or just a header.
>
> I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML).
>
> Sven
>
Thank you! I'll try this tomorrow. If it works well I wonder if we can
still includes it in Pharo6. Since it's only a little feature unused in
Pharo it should not break anything but it would be cool addition for Moose.

But since it is feature freeze if people do not want I'll not push it
for Pharo 6 :)

--
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

NorbertHartl


> Am 03.05.2017 um 18:10 schrieb Cyril Ferlicot D. <[hidden email]>:
>
>> Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit :
>>
>>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> Hi Cyril,
>>>
>>> I want to try to write such a detector. I'll get back to you.
>>
>> I added the following (Zn #bleedingEdge):
>>
>> ===
>> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
>> Author: SvenVanCaekenberghe
>> Time: 3 May 2017, 4:30:44.081888 pm
>> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
>> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
>>
>> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>>
>> Add ZnCharacterEncoderTests>>#testDetectEncoding
>>
>> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>>
>> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
>> ===
>> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
>> Author: SvenVanCaekenberghe
>> Time: 3 May 2017, 4:31:09.469852 pm
>> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
>> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
>>
>> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>>
>> Add ZnCharacterEncoderTests>>#testDetectEncoding
>>
>> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>>
>> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
>> ===
>>
>>
>> Now you can do the following:
>>
>> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]).
>>
>> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
>>    | bytes encoder |
>>    bytes := in upToEnd.
>>    encoder := ZnCharacterEncoder detectEncoding: bytes.
>>    encoder decodeBytes: bytes ].
>>
>> It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
>>
>> You can give the whole contents to the detector, or just a header.
>>
>> I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML).
>>
>> Sven
>>
>
> Thank you! I'll try this tomorrow. If it works well I wonder if we can
> still includes it in Pharo6. Since it's only a little feature unused in
> Pharo it should not break anything but it would be cool addition for Moose.
>
> But since it is feature freeze if people do not want I'll not push it
> for Pharo 6 :)
>
It shouldn't be included. There no such thing as side-effect-free change. Moose can load a newer version of zinc. That is how it is supposed to be.

Norbert
> --
> Cyril Ferlicot
> https://ferlicot.fr
>
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
>


Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

Stephane Ducasse-3
Hi sven 

this is cool. We are always losing time for crlf/lf/cr.... I lost most of the time in the SRT2VTT on that part.
will you add a little paragraph to the Zinc chapter?

Stef

On Wed, May 3, 2017 at 8:18 PM, Norbert Hartl <[hidden email]> wrote:


> Am 03.05.2017 um 18:10 schrieb Cyril Ferlicot D. <[hidden email]>:
>
>> Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit :
>>
>>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote:
>>>
>>> Hi Cyril,
>>>
>>> I want to try to write such a detector. I'll get back to you.
>>
>> I added the following (Zn #bleedingEdge):
>>
>> ===
>> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
>> Author: SvenVanCaekenberghe
>> Time: 3 May 2017, 4:30:44.081888 pm
>> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
>> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
>>
>> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>>
>> Add ZnCharacterEncoderTests>>#testDetectEncoding
>>
>> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>>
>> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
>> ===
>> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
>> Author: SvenVanCaekenberghe
>> Time: 3 May 2017, 4:31:09.469852 pm
>> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
>> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
>>
>> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>>
>> Add ZnCharacterEncoderTests>>#testDetectEncoding
>>
>> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>>
>> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
>> ===
>>
>>
>> Now you can do the following:
>>
>> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]).
>>
>> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
>>    | bytes encoder |
>>    bytes := in upToEnd.
>>    encoder := ZnCharacterEncoder detectEncoding: bytes.
>>    encoder decodeBytes: bytes ].
>>
>> It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
>>
>> You can give the whole contents to the detector, or just a header.
>>
>> I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML).
>>
>> Sven
>>
>
> Thank you! I'll try this tomorrow. If it works well I wonder if we can
> still includes it in Pharo6. Since it's only a little feature unused in
> Pharo it should not break anything but it would be cool addition for Moose.
>
> But since it is feature freeze if people do not want I'll not push it
> for Pharo 6 :)
>
It shouldn't be included. There no such thing as side-effect-free change. Moose can load a newer version of zinc. That is how it is supposed to be.

Norbert
> --
> Cyril Ferlicot
> https://ferlicot.fr
>
> http://www.synectique.eu
> 2 rue Jacques Prévert 01,
> 59650 Villeneuve d'ascq France
>



Reply | Threaded
Open this post in threaded view
|

Re: How to detect encoding of a file

CyrilFerlicot
In reply to this post by Sven Van Caekenberghe-2
On 03/05/2017 16:41, Sven Van Caekenberghe wrote:

> I added the following (Zn #bleedingEdge):
>
> ===
> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:30:44.081888 pm
> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc
> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48
>
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>
> Add ZnCharacterEncoderTests>>#testDetectEncoding
>
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31
> Author: SvenVanCaekenberghe
> Time: 3 May 2017, 4:31:09.469852 pm
> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc
> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30
>
> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes
>
> Add ZnCharacterEncoderTests>>#testDetectEncoding
>
> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder
>
> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding:
> ===
>
>
> Now you can do the following:
>
> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]).
>
> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in |
> | bytes encoder |
> bytes := in upToEnd.
> encoder := ZnCharacterEncoder detectEncoding: bytes.
> encoder decodeBytes: bytes ].
>
> It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection.
>
> You can give the whole contents to the detector, or just a header.
>
> I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML).
>
> Sven
>
Hi,

It seems to guess right in our case and it correct the problems we saw.

Thank you for this! We will integrate it to our tools when the
configuration will be updated.

--
Cyril Ferlicot
https://ferlicot.fr

http://www.synectique.eu
2 rue Jacques Prévert 01,
59650 Villeneuve d'ascq France


signature.asc (817 bytes) Download Attachment