Hi,
I need to distinguish binary files from text files, in Visualworks 7.7.1. Is there an easy way to achieve this? Thank you in advance. Cheers, Alberto _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Am 10.11.10 17:31 schrieb Alberto Bacchelli:
> I need to distinguish binary files from text files, in Visualworks 7.7.1. > Is there an easy way to achieve this? Short answer: No. Longer answer: Text files *are* binary files, i.e. they just contain bits. Whether these bits make sense when interpreted as characters can not be derived from the bits themselves. Different character encodings make this more difficult. You can distinguish text files from binary files only when the files are accompanied by meta data. The simplest form of meta data is based on file name conventions: Use the file name suffix .txt for plain text (usually with ASCII or ISO8859P15 or another encoding which includes ASCII), .xml for XML files, .java for Java source code, .st for Smalltalk source code, .c for C source code etc. So, unless someone or something tells you that it's a text file, in general it is not possible to distinguish text from arbitrary data. Of course, this does not mean that there aren't special cases where simple solutions are possible. Best regards, Joachim Geidel _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
I think Joachim is exactly right.
If your okay with the "95%" case and ie it's just for a local tool, then as Joachim says you can focus on the type. Say ASCII, then I think a lot of bytes equal to [0] indicate binary; or searching for non printables again indicate binary. You could then just run a test to see if you're getting your guesses right. Gary ----- Original Message ----- From: "Joachim Geidel" <[hidden email]> To: "Alberto Bacchelli" <[hidden email]>; <[hidden email]> Sent: Wednesday, November 10, 2010 1:19 PM Subject: Re: [vwnc] Distinguish a binary file from a text file > Am 10.11.10 17:31 schrieb Alberto Bacchelli: >> I need to distinguish binary files from text files, in Visualworks >> 7.7.1. >> Is there an easy way to achieve this? > > Short answer: No. > > Longer answer: Text files *are* binary files, i.e. they just contain bits. > Whether these bits make sense when interpreted as characters can not be > derived from the bits themselves. Different character encodings make this > more difficult. You can distinguish text files from binary files only when > the files are accompanied by meta data. The simplest form of meta data is > based on file name conventions: Use the file name suffix .txt for plain > text > (usually with ASCII or ISO8859P15 or another encoding which includes > ASCII), > .xml for XML files, .java for Java source code, .st for Smalltalk source > code, .c for C source code etc. > > So, unless someone or something tells you that it's a text file, in > general > it is not possible to distinguish text from arbitrary data. Of course, > this > does not mean that there aren't special cases where simple solutions are > possible. > > Best regards, > Joachim Geidel > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Alberto Bacchelli
Only the one posing the question knows the answer.
This is because what is "binary" and what is "text" is completely subjective.
A simple test (if your files are simple) is to check to see if the value of the individual bytes exceeds the range of standard ASCII, and/or if there are many bytes equal to binary zero.
In more and more cases these days, however, this is too simplistic. A Byte Order Mark (BOM) as the first character in the file may be used to indicate what sort of UTF encoding is in use for text files. Sometimes. Even among programs that honor that convention the caveat is offered that other programs (including many system level tools) will not make this distinction.
Simply using the facilities provided by Unicode to express native characters in non-Roman (non-Latin) scripts is enough to make a file appear binary.
You have to know your data's characteristics to make this kind of distinction.
HTH
Les From: [hidden email] on behalf of Alberto Bacchelli Sent: Wed 11/10/2010 8:31 AM To: [hidden email] Subject: [vwnc] Distinguish a binary file from a text file Hi, _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |