Distinguish a binary file from a text file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Distinguish a binary file from a text file

Alberto Bacchelli
Hi,

 I need to distinguish binary files from text files, in Visualworks 7.7.1.
Is there an easy way to achieve this?

Thank you in advance.

Cheers,
 Alberto
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Distinguish a binary file from a text file

Joachim Geidel
Am 10.11.10 17:31 schrieb Alberto Bacchelli:
>  I need to distinguish binary files from text files, in Visualworks 7.7.1.
> Is there an easy way to achieve this?

Short answer: No.

Longer answer: Text files *are* binary files, i.e. they just contain bits.
Whether these bits make sense when interpreted as characters can not be
derived from the bits themselves. Different character encodings make this
more difficult. You can distinguish text files from binary files only when
the files are accompanied by meta data. The simplest form of meta data is
based on file name conventions: Use the file name suffix .txt for plain text
(usually with ASCII or ISO8859P15 or another encoding which includes ASCII),
.xml for XML files, .java for Java source code, .st for Smalltalk source
code, .c for C source code etc.

So, unless someone or something tells you that it's a text file, in general
it is not possible to distinguish text from arbitrary data. Of course, this
does not mean that there aren't special cases where simple solutions are
possible.

Best regards,
Joachim Geidel


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Distinguish a binary file from a text file

Gary Peterson
I think Joachim is exactly right.

If your okay with the "95%" case and ie it's just for a local tool, then as
Joachim says you can focus on the type. Say ASCII, then I think a lot of
bytes equal to [0] indicate binary; or searching for non printables again
indicate binary.

You could then just run a test to see if you're getting your guesses right.

Gary






----- Original Message -----
From: "Joachim Geidel" <[hidden email]>
To: "Alberto Bacchelli" <[hidden email]>; <[hidden email]>
Sent: Wednesday, November 10, 2010 1:19 PM
Subject: Re: [vwnc] Distinguish a binary file from a text file


> Am 10.11.10 17:31 schrieb Alberto Bacchelli:
>>  I need to distinguish binary files from text files, in Visualworks
>> 7.7.1.
>> Is there an easy way to achieve this?
>
> Short answer: No.
>
> Longer answer: Text files *are* binary files, i.e. they just contain bits.
> Whether these bits make sense when interpreted as characters can not be
> derived from the bits themselves. Different character encodings make this
> more difficult. You can distinguish text files from binary files only when
> the files are accompanied by meta data. The simplest form of meta data is
> based on file name conventions: Use the file name suffix .txt for plain
> text
> (usually with ASCII or ISO8859P15 or another encoding which includes
> ASCII),
> .xml for XML files, .java for Java source code, .st for Smalltalk source
> code, .c for C source code etc.
>
> So, unless someone or something tells you that it's a text file, in
> general
> it is not possible to distinguish text from arbitrary data. Of course,
> this
> does not mean that there aren't special cases where simple solutions are
> possible.
>
> Best regards,
> Joachim Geidel
>
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: Distinguish a binary file from a text file

Kooyman, Les
In reply to this post by Alberto Bacchelli
[vwnc] Distinguish a binary file from a text file
Only the one posing the question knows the answer.
 
This is because what is "binary" and what is "text" is completely subjective.
 
A simple test (if your files are simple) is to check to see if the value of the individual bytes exceeds the range of standard ASCII, and/or if there are many bytes equal to binary zero.
 
In more and more cases these days, however, this is too simplistic. A Byte Order Mark (BOM) as the first character in the file may be used to indicate what sort of UTF encoding is in use for text files. Sometimes. Even among programs that honor that convention the caveat is offered that other programs (including many system level tools) will not make this distinction.
 
Simply using the facilities provided by Unicode to express native characters in non-Roman (non-Latin) scripts is enough to make a file appear binary.
 
You have to know your data's characteristics to make this kind of distinction.
 
HTH
 
Les


From: [hidden email] on behalf of Alberto Bacchelli
Sent: Wed 11/10/2010 8:31 AM
To: [hidden email]
Subject: [vwnc] Distinguish a binary file from a text file

Hi,

 I need to distinguish binary files from text files, in Visualworks 7.7.1.
Is there an easy way to achieve this?

Thank you in advance.

Cheers,
 Alberto
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc