Blah, sorry Les, bitten by the default reply recipient... again :(
-------- Original Message --------
On 10.11.2010 21:25, Kooyman, Les wrote:
[vwnc] Distinguish a binary file from a text file
Only
the one posing the question knows the answer.
This is because what is "binary"
and what is "text" is completely subjective.
A simple test (if your files are
simple) is to check to see if the value of the individual
bytes exceeds the range of standard ASCII, and/or if there
are many bytes equal to binary zero.
In more and more cases these days,
however, this is too simplistic. A Byte Order Mark (BOM) as
the first character in the file may be used to indicate what
sort of UTF encoding is in use for text files. Sometimes.
Even among programs that honor that convention the caveat is
offered that other programs (including many system level
tools) will not make this distinction.
Simply using the facilities
provided by Unicode to express native characters in
non-Roman (non-Latin) scripts is enough to make a file
appear binary.
You have to know your data's
characteristics to make this kind of distinction.
HTH
Les
The general answer of "No" is correct.
However, most text files contain a much-higher frequency of the
space character compared to what you'd expect in any at least
semi-random binary file.
In ascii, 1-byte encoded ascii supersets, utf8/16/32, those are all
represented by bytes with value 32.
The "high frequency of zero"-test will fail for utf16/32 if it was
written in a western language.
Good thing about utf16/32 is the BOM marker is mandatory (with the
exception of sources which explicitly define it some other way,
iirc), so if you think it's text. its easy to see if it really is
one of those encodings.
Cheers,
Henry
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc