Smalltalk › Pharo › Pharo Smalltalk Developers

file encodings

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

11 messages Options

Tudor Girba

file encodings

Hi,

I know that there was a long discussion regarding opening of files,
but I did not see a resolution.

I would like to be able to read files regardless of the encoding. I am
using this code:
(CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile"

but I get an error when I reach a file in Latin 1 (ISO-8859-1).

How can I safely deal with this issue without getting hurt?

Cheers,
Doru

--
www.tudorgirba.com

"When people care, great things can happen."

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Nicolas Cellier

Re: file encodings

You may try this:

(CrLfFileStream readOnlyFileNamed: fullPath) converter:
Latin1TextConverter new; contentsOfEntireFile

Or if you want more explicit control:

(MultiByteFileStream readOnlyFileNamed: fullPath) lineEndConvention:
#crlf; converter: Latin1TextConverter new; contentsOfEntireFile.

Nicolas

2009/11/30 Tudor Girba <[hidden email]>:

> Hi,
>
> I know that there was a long discussion regarding opening of files,
> but I did not see a resolution.
>
> I would like to be able to read files regardless of the encoding. I am
> using this code:
> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile"
>
> but I get an error when I reach a file in Latin 1 (ISO-8859-1).
>
> How can I safely deal with this issue without getting hurt?
>
> Cheers,
> Doru
>
> --
> www.tudorgirba.com
>
> "When people care, great things can happen."
>
>
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Tudor Girba

Re: file encodings

Hi Nicolas,

Thanks for the tips.

I am a bit alien to the topic of encodings, so I will ask a couple of
silly questions.

How exactly do I deal with another encoding? For example, why does
this work with Utf-8, and what would I do if I encountered another
encoding?

Or more general, how can I find out what kind of converter I need for
a given file?

Cheers,
Doru

On 30 Nov 2009, at 12:30, Nicolas Cellier wrote:

> You may try this:
>
> (CrLfFileStream readOnlyFileNamed: fullPath) converter:
> Latin1TextConverter new; contentsOfEntireFile
>
> Or if you want more explicit control:
>
> (MultiByteFileStream readOnlyFileNamed: fullPath) lineEndConvention:
> #crlf; converter: Latin1TextConverter new; contentsOfEntireFile.
>
> Nicolas
>
> 2009/11/30 Tudor Girba <[hidden email]>:
>> Hi,
>>
>> I know that there was a long discussion regarding opening of files,
>> but I did not see a resolution.
>>
>> I would like to be able to read files regardless of the encoding. I
>> am
>> using this code:
>> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile"
>>
>> but I get an error when I reach a file in Latin 1 (ISO-8859-1).
>>
>> How can I safely deal with this issue without getting hurt?
>>
>> Cheers,
>> Doru
>>
>> --
>> www.tudorgirba.com
>>
>> "When people care, great things can happen."
>>
>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

--
www.tudorgirba.com

"When people care, great things can happen."

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: file encodings

In reply to this post by Tudor Girba

Doru

we integrated the suggestion of nicolas as result of the discussion.

Stef

On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote:

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

johnmci

Re: file encodings

Er, so what's the default behavior, does it mean that UTF-8 encoding files on
OS-X will open as ISO-8859-1 ?

On 2009-11-30, at 5:52 AM, Stéphane Ducasse wrote:

> Doru
>
> we integrated the suggestion of nicolas as result of the discussion.
>
> Stef
>
> On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote:
>
>> Hi,
>>
>> I know that there was a long discussion regarding opening of files,
>> but I did not see a resolution.
>>
>> I would like to be able to read files regardless of the encoding. I am
>> using this code:
>> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile"
>>
>> but I get an error when I reach a file in Latin 1 (ISO-8859-1).

--
===========================================================================
John M. McIntosh <[hidden email]> Twitter: squeaker68882
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
===========================================================================

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Nicolas Cellier

Re: file encodings

Currently default behaviour depends on LanguageEnvironment...
... which should have hacks depending on the underlying platform.

So I agree with John: BEWARE.
My workaround was not intended as a general fix, but just a helper for Tudor...

Cheers

Nicolas

2009/11/30 John M McIntosh <[hidden email]>:

> Er, so what's the default behavior, does it mean that UTF-8 encoding files on
> OS-X will open as ISO-8859-1 ?
>
>
> On 2009-11-30, at 5:52 AM, Stéphane Ducasse wrote:
>
>> Doru
>>
>> we integrated the suggestion of nicolas as result of the discussion.
>>
>> Stef
>>
>> On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote:
>>
>>> Hi,
>>>
>>> I know that there was a long discussion regarding opening of files,
>>> but I did not see a resolution.
>>>
>>> I would like to be able to read files regardless of the encoding. I am
>>> using this code:
>>> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile"
>>>
>>> but I get an error when I reach a file in Latin 1 (ISO-8859-1).
>
> --
> ===========================================================================
> John M. McIntosh <[hidden email]> Twitter: squeaker68882
> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
> ===========================================================================
>
>
>
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Tudor Girba

Re: file encodings

Thanks, Nicolas.

It seems to work for the moment. I will get back to you if I encounter
more problems :).

Cheers,
Doru

On 30 Nov 2009, at 16:43, Nicolas Cellier wrote:

> Currently default behaviour depends on LanguageEnvironment...
> ... which should have hacks depending on the underlying platform.
>
> So I agree with John: BEWARE.
> My workaround was not intended as a general fix, but just a helper
> for Tudor...
>
> Cheers
>
> Nicolas
>
> 2009/11/30 John M McIntosh <[hidden email]>:
>> Er, so what's the default behavior, does it mean that UTF-8
>> encoding files on
>> OS-X will open as ISO-8859-1 ?
>>
>>
>> On 2009-11-30, at 5:52 AM, Stéphane Ducasse wrote:
>>
>>> Doru
>>>
>>> we integrated the suggestion of nicolas as result of the discussion.
>>>
>>> Stef
>>>
>>> On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote:
>>>
>>>> Hi,
>>>>
>>>> I know that there was a long discussion regarding opening of files,
>>>> but I did not see a resolution.
>>>>
>>>> I would like to be able to read files regardless of the encoding.
>>>> I am
>>>> using this code:
>>>> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile"
>>>>
>>>> but I get an error when I reach a file in Latin 1 (ISO-8859-1).
>>
>> --
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>> John M. McIntosh <[hidden email]> Twitter:
>> squeaker68882
>> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
>> =
>> =
>> =
>> =
>> =
>> =
>> =====================================================================
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

--
www.tudorgirba.com

"Problem solving should be concentrated on describing
the problem in a way that is relevant for the solution."

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2-3

Re: file encodings

In reply to this post by Tudor Girba

Tudor Girba wrote:

> Hi Nicolas,
>
> Thanks for the tips.
>
> I am a bit alien to the topic of encodings, so I will ask a couple of
> silly questions.
>
> How exactly do I deal with another encoding? For example, why does
> this work with Utf-8, and what would I do if I encountered another
> encoding?
>
> Or more general, how can I find out what kind of converter I need for
> a given file?

Welcome to a world full of shit.

Executive summary: you're fucked.

To read a file as a String instead of a ByteArray you need a way to map
bytes to characters aka the file encoding. The problem is there is no
way of knowing what the file encoding is. There just isn't because the
operating system doesn't know and it does all the file handling. The
only thing that you can know is when a file is not UTF-8. That is when
you treat it as UTF-8 and get an UTF-8 error.

You can pretend the problem doesn't exist and just use a random encoding
aka the platform default encoding. That would work if all your files
were are created on your computer and all your programs used the
platform default encoding. E.g. you don't download stuff from the
internet or copy stuff off storage media (USB sticks, CDs, DVDs, ...).

The funny thing is that XML partially solves the problem because the
encoding can be put into the preamble. In fact it should if the file is
not ASCII. You can make a guess whether Yaxo supports binary input and
detecting the encoding either from the BOM or the preamble.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Michael Rueger-6

Re: file encodings

Philippe Marschall wrote:
> Tudor Girba wrote:

>> Or more general, how can I find out what kind of converter I need for
>> a given file?

...

> You can pretend the problem doesn't exist and just use a random encoding
> aka the platform default encoding. That would work if all your files
> were are created on your computer and all your programs used the

At least for those files it will work in almost all cases.

> platform default encoding. E.g. you don't download stuff from the
> internet or copy stuff off storage media (USB sticks, CDs, DVDs, ...).

Well, there are some heuristics, but nothing failsafe...
If the file has cr-lf, it is probably a windows file
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#ISO-8859-1_and_Windows-1252_confusion

You can check for a Unicode BOM.

.html should be UTF-8 these days.

Try reading UTF-8, if it fails try re-reading with platform encoding.

>
> The funny thing is that XML partially solves the problem because the
> encoding can be put into the preamble. In fact it should if the file is
> not ASCII. You can make a guess whether Yaxo supports binary input and
> detecting the encoding either from the BOM or the preamble.

Hmm, to be honest I'm not sure myself, but I think I (with help from
others) did some work a while ago to make it read files according to the
encoding entry. But maybe my memory has the wrong encoding ;-)

Michael

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Michael Rueger-6

Re: file encodings

...and of course you should read the classic on this:

http://www.joelonsoftware.com/articles/Unicode.html

"There Ain't No Such Thing As Plain Text."

;-)

Michael

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2-3

Re: file encodings

In reply to this post by Michael Rueger-6

Michael Rueger wrote:

> Philippe Marschall wrote:
>> Tudor Girba wrote:
>
>>> Or more general, how can I find out what kind of converter I need for
>>> a given file?
>
> ...
>
>> You can pretend the problem doesn't exist and just use a random encoding
>> aka the platform default encoding. That would work if all your files
>> were are created on your computer and all your programs used the
>
> At least for those files it will work in almost all cases.

For the others you might not notice the error.

>> platform default encoding. E.g. you don't download stuff from the
>> internet or copy stuff off storage media (USB sticks, CDs, DVDs, ...).
>
> Well, there are some heuristics, but nothing failsafe...
> If the file has cr-lf, it is probably a windows file
> http://en.wikipedia.org/wiki/ISO/IEC_8859-1#ISO-8859-1_and_Windows-1252_confusion

Maybe, maybe somebody actually got Latin-1 right and then you might
corrupt it when treating it as code page 2152. Moreover there are more
Windows code pages than 1252. And finally there are other places in the
world besides central Europe.

> You can check for a Unicode BOM.

Which nobody does (in Pharo), may not be there and only works for
UTF-8/16/32.

> .html should be UTF-8 these days.

Yeah totally:
http://squeakland.org/

Keep in mind there are still commercial Smalltalk dialects today
struggling with UTF-8.

> Try reading UTF-8, if it fails try re-reading with platform encoding.
>
>> The funny thing is that XML partially solves the problem because the
>> encoding can be put into the preamble. In fact it should if the file is
>> not ASCII. You can make a guess whether Yaxo supports binary input and
>> detecting the encoding either from the BOM or the preamble.
>
> Hmm, to be honest I'm not sure myself, but I think I (with help from
> others) did some work a while ago to make it read files according to the
> encoding entry. But maybe my memory has the wrong encoding ;-)

Failed for the very simplistic attached file with the following code

(XMLDOMParser parseDocumentFrom: (FileStream fileNamed: 'latin-1.xml'))
elements first characterData

I had to run the FileStream in text mode, the parser would blow up in
binary mode. If my platform encoding was utf-8 it would probably have
blown up. Works fine out of the box in Eclipse though.

Cheers
Philippe

<?xml version="1.0" encoding="ISO-8859-1"?>
<auo>
äüö
</auo>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project