Fwd: Re: Distinguish a binary file from a text file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Re: Distinguish a binary file from a text file

Henrik Sperre Johansen
Blah, sorry Les, bitten by the default reply recipient... again :(

-------- Original Message --------
Subject: Re: [vwnc] Distinguish a binary file from a text file
Date: Thu, 11 Nov 2010 00:37:00 +0100
From: Henrik Sperre Johansen [hidden email]
To: Kooyman, Les [hidden email]


On 10.11.2010 21:25, Kooyman, Les wrote:
[vwnc] Distinguish a binary file from a text file
Only the one posing the question knows the answer.
 
This is because what is "binary" and what is "text" is completely subjective.
 
A simple test (if your files are simple) is to check to see if the value of the individual bytes exceeds the range of standard ASCII, and/or if there are many bytes equal to binary zero.
 
In more and more cases these days, however, this is too simplistic. A Byte Order Mark (BOM) as the first character in the file may be used to indicate what sort of UTF encoding is in use for text files. Sometimes. Even among programs that honor that convention the caveat is offered that other programs (including many system level tools) will not make this distinction.
 
Simply using the facilities provided by Unicode to express native characters in non-Roman (non-Latin) scripts is enough to make a file appear binary.
 
You have to know your data's characteristics to make this kind of distinction.
 
HTH
 
Les

The general answer of "No" is correct.
However, most text files contain a much-higher frequency of the space character compared to what you'd expect in any at least semi-random binary file.
In ascii, 1-byte encoded ascii supersets, utf8/16/32,  those are all represented by bytes with value 32.

The "high frequency of zero"-test will fail for utf16/32 if it was written in a western language.
Good thing about utf16/32 is the BOM marker is mandatory (with the exception of sources which explicitly define it some other way, iirc), so if you think it's text. its easy to see if it really is one of those encodings.

Cheers,
Henry
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc