Hi,
I know that there was a long discussion regarding opening of files, but I did not see a resolution. I would like to be able to read files regardless of the encoding. I am using this code: (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" but I get an error when I reach a file in Latin 1 (ISO-8859-1). How can I safely deal with this issue without getting hurt? Cheers, Doru -- www.tudorgirba.com "When people care, great things can happen." _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
You may try this:
(CrLfFileStream readOnlyFileNamed: fullPath) converter: Latin1TextConverter new; contentsOfEntireFile Or if you want more explicit control: (MultiByteFileStream readOnlyFileNamed: fullPath) lineEndConvention: #crlf; converter: Latin1TextConverter new; contentsOfEntireFile. Nicolas 2009/11/30 Tudor Girba <[hidden email]>: > Hi, > > I know that there was a long discussion regarding opening of files, > but I did not see a resolution. > > I would like to be able to read files regardless of the encoding. I am > using this code: > (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" > > but I get an error when I reach a file in Latin 1 (ISO-8859-1). > > How can I safely deal with this issue without getting hurt? > > Cheers, > Doru > > -- > www.tudorgirba.com > > "When people care, great things can happen." > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Hi Nicolas,
Thanks for the tips. I am a bit alien to the topic of encodings, so I will ask a couple of silly questions. How exactly do I deal with another encoding? For example, why does this work with Utf-8, and what would I do if I encountered another encoding? Or more general, how can I find out what kind of converter I need for a given file? Cheers, Doru On 30 Nov 2009, at 12:30, Nicolas Cellier wrote: > You may try this: > > (CrLfFileStream readOnlyFileNamed: fullPath) converter: > Latin1TextConverter new; contentsOfEntireFile > > Or if you want more explicit control: > > (MultiByteFileStream readOnlyFileNamed: fullPath) lineEndConvention: > #crlf; converter: Latin1TextConverter new; contentsOfEntireFile. > > Nicolas > > 2009/11/30 Tudor Girba <[hidden email]>: >> Hi, >> >> I know that there was a long discussion regarding opening of files, >> but I did not see a resolution. >> >> I would like to be able to read files regardless of the encoding. I >> am >> using this code: >> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" >> >> but I get an error when I reach a file in Latin 1 (ISO-8859-1). >> >> How can I safely deal with this issue without getting hurt? >> >> Cheers, >> Doru >> >> -- >> www.tudorgirba.com >> >> "When people care, great things can happen." >> >> >> >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project -- www.tudorgirba.com "When people care, great things can happen." _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Tudor Girba
Doru
we integrated the suggestion of nicolas as result of the discussion. Stef On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote: > Hi, > > I know that there was a long discussion regarding opening of files, > but I did not see a resolution. > > I would like to be able to read files regardless of the encoding. I am > using this code: > (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" > > but I get an error when I reach a file in Latin 1 (ISO-8859-1). > > How can I safely deal with this issue without getting hurt? > > Cheers, > Doru > > -- > www.tudorgirba.com > > "When people care, great things can happen." > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Er, so what's the default behavior, does it mean that UTF-8 encoding files on
OS-X will open as ISO-8859-1 ? On 2009-11-30, at 5:52 AM, Stéphane Ducasse wrote: > Doru > > we integrated the suggestion of nicolas as result of the discussion. > > Stef > > On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote: > >> Hi, >> >> I know that there was a long discussion regarding opening of files, >> but I did not see a resolution. >> >> I would like to be able to read files regardless of the encoding. I am >> using this code: >> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" >> >> but I get an error when I reach a file in Latin 1 (ISO-8859-1). -- =========================================================================== John M. McIntosh <[hidden email]> Twitter: squeaker68882 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com =========================================================================== _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Currently default behaviour depends on LanguageEnvironment...
... which should have hacks depending on the underlying platform. So I agree with John: BEWARE. My workaround was not intended as a general fix, but just a helper for Tudor... Cheers Nicolas 2009/11/30 John M McIntosh <[hidden email]>: > Er, so what's the default behavior, does it mean that UTF-8 encoding files on > OS-X will open as ISO-8859-1 ? > > > On 2009-11-30, at 5:52 AM, Stéphane Ducasse wrote: > >> Doru >> >> we integrated the suggestion of nicolas as result of the discussion. >> >> Stef >> >> On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote: >> >>> Hi, >>> >>> I know that there was a long discussion regarding opening of files, >>> but I did not see a resolution. >>> >>> I would like to be able to read files regardless of the encoding. I am >>> using this code: >>> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" >>> >>> but I get an error when I reach a file in Latin 1 (ISO-8859-1). > > -- > =========================================================================== > John M. McIntosh <[hidden email]> Twitter: squeaker68882 > Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com > =========================================================================== > > > > > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Thanks, Nicolas.
It seems to work for the moment. I will get back to you if I encounter more problems :). Cheers, Doru On 30 Nov 2009, at 16:43, Nicolas Cellier wrote: > Currently default behaviour depends on LanguageEnvironment... > ... which should have hacks depending on the underlying platform. > > So I agree with John: BEWARE. > My workaround was not intended as a general fix, but just a helper > for Tudor... > > Cheers > > Nicolas > > 2009/11/30 John M McIntosh <[hidden email]>: >> Er, so what's the default behavior, does it mean that UTF-8 >> encoding files on >> OS-X will open as ISO-8859-1 ? >> >> >> On 2009-11-30, at 5:52 AM, Stéphane Ducasse wrote: >> >>> Doru >>> >>> we integrated the suggestion of nicolas as result of the discussion. >>> >>> Stef >>> >>> On Nov 30, 2009, at 12:11 PM, Tudor Girba wrote: >>> >>>> Hi, >>>> >>>> I know that there was a long discussion regarding opening of files, >>>> but I did not see a resolution. >>>> >>>> I would like to be able to read files regardless of the encoding. >>>> I am >>>> using this code: >>>> (CrLfFileStream readOnlyFileNamed: fullPath) contentsOfEntireFile" >>>> >>>> but I get an error when I reach a file in Latin 1 (ISO-8859-1). >> >> -- >> = >> = >> = >> = >> = >> = >> ===================================================================== >> John M. McIntosh <[hidden email]> Twitter: >> squeaker68882 >> Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com >> = >> = >> = >> = >> = >> = >> ===================================================================== >> >> >> >> >> >> _______________________________________________ >> Pharo-project mailing list >> [hidden email] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > > _______________________________________________ > Pharo-project mailing list > [hidden email] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project -- www.tudorgirba.com "Problem solving should be concentrated on describing the problem in a way that is relevant for the solution." _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Tudor Girba
Tudor Girba wrote:
> Hi Nicolas, > > Thanks for the tips. > > I am a bit alien to the topic of encodings, so I will ask a couple of > silly questions. > > How exactly do I deal with another encoding? For example, why does > this work with Utf-8, and what would I do if I encountered another > encoding? > > Or more general, how can I find out what kind of converter I need for > a given file? Welcome to a world full of shit. Executive summary: you're fucked. To read a file as a String instead of a ByteArray you need a way to map bytes to characters aka the file encoding. The problem is there is no way of knowing what the file encoding is. There just isn't because the operating system doesn't know and it does all the file handling. The only thing that you can know is when a file is not UTF-8. That is when you treat it as UTF-8 and get an UTF-8 error. You can pretend the problem doesn't exist and just use a random encoding aka the platform default encoding. That would work if all your files were are created on your computer and all your programs used the platform default encoding. E.g. you don't download stuff from the internet or copy stuff off storage media (USB sticks, CDs, DVDs, ...). The funny thing is that XML partially solves the problem because the encoding can be put into the preamble. In fact it should if the file is not ASCII. You can make a guess whether Yaxo supports binary input and detecting the encoding either from the BOM or the preamble. Cheers Philippe _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Philippe Marschall wrote:
> Tudor Girba wrote: >> Or more general, how can I find out what kind of converter I need for >> a given file? ... > You can pretend the problem doesn't exist and just use a random encoding > aka the platform default encoding. That would work if all your files > were are created on your computer and all your programs used the At least for those files it will work in almost all cases. > platform default encoding. E.g. you don't download stuff from the > internet or copy stuff off storage media (USB sticks, CDs, DVDs, ...). Well, there are some heuristics, but nothing failsafe... If the file has cr-lf, it is probably a windows file http://en.wikipedia.org/wiki/ISO/IEC_8859-1#ISO-8859-1_and_Windows-1252_confusion You can check for a Unicode BOM. .html should be UTF-8 these days. Try reading UTF-8, if it fails try re-reading with platform encoding. > > The funny thing is that XML partially solves the problem because the > encoding can be put into the preamble. In fact it should if the file is > not ASCII. You can make a guess whether Yaxo supports binary input and > detecting the encoding either from the BOM or the preamble. Hmm, to be honest I'm not sure myself, but I think I (with help from others) did some work a while ago to make it read files according to the encoding entry. But maybe my memory has the wrong encoding ;-) Michael _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
...and of course you should read the classic on this: http://www.joelonsoftware.com/articles/Unicode.html "There Ain't No Such Thing As Plain Text." ;-) Michael _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
In reply to this post by Michael Rueger-6
Michael Rueger wrote:
> Philippe Marschall wrote: >> Tudor Girba wrote: > >>> Or more general, how can I find out what kind of converter I need for >>> a given file? > > ... > >> You can pretend the problem doesn't exist and just use a random encoding >> aka the platform default encoding. That would work if all your files >> were are created on your computer and all your programs used the > > At least for those files it will work in almost all cases. >> platform default encoding. E.g. you don't download stuff from the >> internet or copy stuff off storage media (USB sticks, CDs, DVDs, ...). > > Well, there are some heuristics, but nothing failsafe... > If the file has cr-lf, it is probably a windows file > http://en.wikipedia.org/wiki/ISO/IEC_8859-1#ISO-8859-1_and_Windows-1252_confusion Maybe, maybe somebody actually got Latin-1 right and then you might corrupt it when treating it as code page 2152. Moreover there are more Windows code pages than 1252. And finally there are other places in the world besides central Europe. > You can check for a Unicode BOM. Which nobody does (in Pharo), may not be there and only works for UTF-8/16/32. > .html should be UTF-8 these days. Yeah totally: http://squeakland.org/ Keep in mind there are still commercial Smalltalk dialects today struggling with UTF-8. > Try reading UTF-8, if it fails try re-reading with platform encoding. > >> The funny thing is that XML partially solves the problem because the >> encoding can be put into the preamble. In fact it should if the file is >> not ASCII. You can make a guess whether Yaxo supports binary input and >> detecting the encoding either from the BOM or the preamble. > > Hmm, to be honest I'm not sure myself, but I think I (with help from > others) did some work a while ago to make it read files according to the > encoding entry. But maybe my memory has the wrong encoding ;-) (XMLDOMParser parseDocumentFrom: (FileStream fileNamed: 'latin-1.xml')) elements first characterData I had to run the FileStream in text mode, the parser would blow up in binary mode. If my platform encoding was utf-8 it would probably have blown up. Works fine out of the box in Eclipse though. Cheers Philippe <?xml version="1.0" encoding="ISO-8859-1"?> <auo> äüö </auo> _______________________________________________ Pharo-project mailing list [hidden email] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project |
Free forum by Nabble | Edit this page |