Hello,
We have a problem using Moose because we have files which we don't know the encoding. Currently we have this implementation to get the content of a file: completeText self fileReference exists ifFalse: [ ^ '' ]. ^ self fileReference readStreamDo: [ :s | [ s contents ] on: Error do: [ [ s converter: Latin1TextConverter new; contents ] on: Error do: [ '' ] ] ] But, we have a problem because we have currently some files at Synectique in ISO-8859-1. The problem is that #contents is able to read some of the files without throwing an error, but the content is not right because it is not the good encoding. Thus I wonder if it is possible to get the Encoding of a FileReference in Pharo to be able to read the file with the right encoding? Something like the bash command `file -I myFile.txt`. -- Cyril Ferlicot https://ferlicot.fr http://www.synectique.eu 2 rue Jacques Prévert 01, 59650 Villeneuve d'ascq France signature.asc (817 bytes) Download Attachment |
Hi Cyril,
I want to try to write such a detector. I'll get back to you. Any chance you could give me (part of) a file that causes you trouble (one that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ? Sven > On 3 May 2017, at 11:40, Cyril Ferlicot D. <[hidden email]> wrote: > > Hello, > > We have a problem using Moose because we have files which we don't know > the encoding. Currently we have this implementation to get the content > of a file: > > completeText > self fileReference exists ifFalse: [ ^ '' ]. > ^ self fileReference readStreamDo: [ :s | > [ s contents ] > on: Error > do: [ [ s converter: Latin1TextConverter new; contents ] > on: Error > do: [ '' ] ] ] > > But, we have a problem because we have currently some files at > Synectique in ISO-8859-1. The problem is that #contents is able to read > some of the files without throwing an error, but the content is not > right because it is not the good encoding. > > Thus I wonder if it is possible to get the Encoding of a FileReference > in Pharo to be able to read the file with the right encoding? Something > like the bash command `file -I myFile.txt`. > > -- > Cyril Ferlicot > https://ferlicot.fr > > http://www.synectique.eu > 2 rue Jacques Prévert 01, > 59650 Villeneuve d'ascq France > |
I'm not sure, but maybe useful...
cheers -ben On Wed, May 3, 2017 at 6:18 PM, Sven Van Caekenberghe <[hidden email]> wrote: Hi Cyril, |
In reply to this post by Sven Van Caekenberghe-2
> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote: > > Hi Cyril, > > I want to try to write such a detector. I'll get back to you. I added the following (Zn #bleedingEdge): === Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 Author: SvenVanCaekenberghe Time: 3 May 2017, 4:30:44.081888 pm UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes Add ZnCharacterEncoderTests>>#testDetectEncoding Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: === Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 Author: SvenVanCaekenberghe Time: 3 May 2017, 4:31:09.469852 pm UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes Add ZnCharacterEncoderTests>>#testDetectEncoding Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: === Now you can do the following: ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]). (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | | bytes encoder | bytes := in upToEnd. encoder := ZnCharacterEncoder detectEncoding: bytes. encoder decodeBytes: bytes ]. It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection. You can give the whole contents to the detector, or just a header. I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML). Sven > Any chance you could give me (part of) a file that causes you trouble (one that is legal latin1, yet does not fail utf-8 while doing it wrong in utf-8) ? > > Sven > >> On 3 May 2017, at 11:40, Cyril Ferlicot D. <[hidden email]> wrote: >> >> Hello, >> >> We have a problem using Moose because we have files which we don't know >> the encoding. Currently we have this implementation to get the content >> of a file: >> >> completeText >> self fileReference exists ifFalse: [ ^ '' ]. >> ^ self fileReference readStreamDo: [ :s | >> [ s contents ] >> on: Error >> do: [ [ s converter: Latin1TextConverter new; contents ] >> on: Error >> do: [ '' ] ] ] >> >> But, we have a problem because we have currently some files at >> Synectique in ISO-8859-1. The problem is that #contents is able to read >> some of the files without throwing an error, but the content is not >> right because it is not the good encoding. >> >> Thus I wonder if it is possible to get the Encoding of a FileReference >> in Pharo to be able to read the file with the right encoding? Something >> like the bash command `file -I myFile.txt`. >> >> -- >> Cyril Ferlicot >> https://ferlicot.fr >> >> http://www.synectique.eu >> 2 rue Jacques Prévert 01, >> 59650 Villeneuve d'ascq France >> > |
Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit :
> >> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote: >> >> Hi Cyril, >> >> I want to try to write such a detector. I'll get back to you. > > I added the following (Zn #bleedingEdge): > > === > Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 > Author: SvenVanCaekenberghe > Time: 3 May 2017, 4:30:44.081888 pm > UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc > Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 > > Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes > > Add ZnCharacterEncoderTests>>#testDetectEncoding > > Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder > > Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: > === > Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 > Author: SvenVanCaekenberghe > Time: 3 May 2017, 4:31:09.469852 pm > UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc > Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 > > Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes > > Add ZnCharacterEncoderTests>>#testDetectEncoding > > Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder > > Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: > === > > > Now you can do the following: > > ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]). > > (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | > | bytes encoder | > bytes := in upToEnd. > encoder := ZnCharacterEncoder detectEncoding: bytes. > encoder decodeBytes: bytes ]. > > It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection. > > You can give the whole contents to the detector, or just a header. > > I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML). > > Sven > still includes it in Pharo6. Since it's only a little feature unused in Pharo it should not break anything but it would be cool addition for Moose. But since it is feature freeze if people do not want I'll not push it for Pharo 6 :) -- Cyril Ferlicot https://ferlicot.fr http://www.synectique.eu 2 rue Jacques Prévert 01, 59650 Villeneuve d'ascq France signature.asc (836 bytes) Download Attachment |
> Am 03.05.2017 um 18:10 schrieb Cyril Ferlicot D. <[hidden email]>: > >> Le 03/05/2017 à 16:41, Sven Van Caekenberghe a écrit : >> >>> On 3 May 2017, at 12:18, Sven Van Caekenberghe <[hidden email]> wrote: >>> >>> Hi Cyril, >>> >>> I want to try to write such a detector. I'll get back to you. >> >> I added the following (Zn #bleedingEdge): >> >> === >> Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 >> Author: SvenVanCaekenberghe >> Time: 3 May 2017, 4:30:44.081888 pm >> UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc >> Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 >> >> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes >> >> Add ZnCharacterEncoderTests>>#testDetectEncoding >> >> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder >> >> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: >> === >> Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 >> Author: SvenVanCaekenberghe >> Time: 3 May 2017, 4:31:09.469852 pm >> UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc >> Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 >> >> Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes >> >> Add ZnCharacterEncoderTests>>#testDetectEncoding >> >> Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder >> >> Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: >> === >> >> >> Now you can do the following: >> >> ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]). >> >> (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | >> | bytes encoder | >> bytes := in upToEnd. >> encoder := ZnCharacterEncoder detectEncoding: bytes. >> encoder decodeBytes: bytes ]. >> >> It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection. >> >> You can give the whole contents to the detector, or just a header. >> >> I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML). >> >> Sven >> > > Thank you! I'll try this tomorrow. If it works well I wonder if we can > still includes it in Pharo6. Since it's only a little feature unused in > Pharo it should not break anything but it would be cool addition for Moose. > > But since it is feature freeze if people do not want I'll not push it > for Pharo 6 :) > Norbert > -- > Cyril Ferlicot > https://ferlicot.fr > > http://www.synectique.eu > 2 rue Jacques Prévert 01, > 59650 Villeneuve d'ascq France > |
Hi sven this is cool. We are always losing time for crlf/lf/cr.... I lost most of the time in the SRT2VTT on that part. will you add a little paragraph to the Zinc chapter? Stef On Wed, May 3, 2017 at 8:18 PM, Norbert Hartl <[hidden email]> wrote:
|
In reply to this post by Sven Van Caekenberghe-2
On 03/05/2017 16:41, Sven Van Caekenberghe wrote:
> I added the following (Zn #bleedingEdge): > > === > Name: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.49 > Author: SvenVanCaekenberghe > Time: 3 May 2017, 4:30:44.081888 pm > UUID: fe8b083d-010b-0d00-9df5-fde304bccfdc > Ancestors: Zinc-Character-Encoding-Core-SvenVanCaekenberghe.48 > > Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes > > Add ZnCharacterEncoderTests>>#testDetectEncoding > > Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder > > Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: > === > Name: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.31 > Author: SvenVanCaekenberghe > Time: 3 May 2017, 4:31:09.469852 pm > UUID: 30ef8b3e-010b-0d00-9df6-4a9304bccfdc > Ancestors: Zinc-Character-Encoding-Tests-SvenVanCaekenberghe.30 > > Add ZnCharacterEncoder class>>#detectEncoding: to try to heuristically and unreliably guess the encoding used by a collection of bytes > > Add ZnCharacterEncoderTests>>#testDetectEncoding > > Add #= and #hash to ZnSimplifiedByteEncoder and ZnEndianSensitiveUTFEncoder > > Always use canonical name in ZnSimplifiedByteEncoder class>>#newForEncoding: > === > > > Now you can do the following: > > ZnCharacterEncoder detectEncoding: ((FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | in upToEnd ]). > > (FileLocator desktop / 'some.data') binaryReadStreamDo: [ :in | > | bytes encoder | > bytes := in upToEnd. > encoder := ZnCharacterEncoder detectEncoding: bytes. > encoder decodeBytes: bytes ]. > > It works on the test file you gave me, but this process is just a guess, a heuristic that is unreliable and often wrong (especially for very similar byte encodings), see https://en.wikipedia.org/wiki/Charset_detection. > > You can give the whole contents to the detector, or just a header. > > I was a bit too optimistic though, this is basically an unsolvable problem. It is MUCH better to somehow know up front what the encoding used is, or to know something useable about the contents (like the header of HTML or XML). > > Sven > It seems to guess right in our case and it correct the problems we saw. Thank you for this! We will integrate it to our tools when the configuration will be updated. -- Cyril Ferlicot https://ferlicot.fr http://www.synectique.eu 2 rue Jacques Prévert 01, 59650 Villeneuve d'ascq France signature.asc (817 bytes) Download Attachment |
Free forum by Nabble | Edit this page |