The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Utility to determine file types.

Does anyone know of a utility that will allow the following (more or less):

...

FileTypeEnum fte=Utility.getFileType(PathToFile)

...

with FileTypeEnum being a more-or-less comprehensive list of {plain_text, rtf, ms_word4.0, ms_word4.1, html, css, etc, jpeg, mpeg, xml, etc, etc)

???
mynameishere
Tuesday, January 16, 2007
 
 
Whoops. I put an "etc" in the middle of the list. Never seen that before...
mynameishere
Tuesday, January 16, 2007
 
 
The "file" command on various UNIX variants does this. If you happen to have a BSD-based unix system around, look at the contents of the /usr/share/magic file - it's a simple text database of "magic numbers" to identify different types of files.
Mark Bessey Send private email
Tuesday, January 16, 2007
 
 
Thanks. That seems okay. I guess I'd have to wrap it in an API. It also has the odd habit of labeling eg .doc and .ppt files as "Microsoft Office Document" rather than Word and Powerpoint.

...also .java is identified as .c
mynameishere
Tuesday, January 16, 2007
 
 
One implementation is File::MMagic, should be easy to port:

http://search.cpan.org/dist/File-MMagic/
Chris Winters Send private email
Tuesday, January 16, 2007
 
 
Just use the Microsoft Shell controls and automation library:

Private m_DrivesFolder As Shell32.Folder

Private Sub InitDrivesFolder()
    With New Shell32.Shell
        Set m_DrivesFolder = .NameSpace(ssfDRIVES)
    End With
End Sub

Private Function GetType(ByVal FName As String) As String
    GetType = m_DrivesFolder.ParseName(FName).Type
End Function
Codger
Tuesday, January 16, 2007
 
 
It depends on the environment. In the worst case (you're trying to determine the type of a file posted to a web server, and you need compatibility with all browsers) the only thing that will work is looking at the file's contents like the Unix "file" command does. There are several Windows binaries of it available. On the other extreme, you may be able to get away with just looking at the file extension.

In any case, you need to be prepared to deal with unknown file types. There is no comprehensive central file type list.
clcr
Tuesday, January 16, 2007
 
 
Windows stores the mapping from file type to mime type in the registry.  It's mentioned here http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp and a bit of Googling might give you a better reference page too.
John Rusk
Wednesday, January 17, 2007
 
 
I rename a dll and give it a .txt extension. Is it now a text file or is it still a dll?
Peter
Wednesday, January 17, 2007
 
 
<mynameishere>
It also has the odd habit of labeling eg .doc and .ppt files as "Microsoft Office Document" rather than Word and Powerpoint.

...also .java is identified as .c
</mynameishere>

Yeah, that's the best you can get if you just look at a file's header bytes. All Microsoft Office documents start with the bytes 0xD0CF, so they're indistinguishable from one another unless you do some more intelligent analysis.

Likewise, if a Java file starts with comment marker ("/*" or "//"), then it'll look just like a C/C++ file.

Last year, I implemented a statistical classification algorithm that does a very good job of differentiating between different filetypes. It has about 98% accuracy in differentiating different MS Office types, as well as Java vs C++ vs C#. It gets about 65% ~ 75% accuracy in differentiating EXE and DLL files (those are pretty damn hard) and 100% accuracy with lots of unambiguous types (zip, pdf, png, jpg, etc).

The algorithm is non-trivial (my company is seeking a patent on the technique we used) and it requires a large representative training set of existing files.

Anyhoo, good luck :-)
BenjiSmith Send private email
Wednesday, January 17, 2007
 
 
"... The algorithm is non-trivial (my company is seeking a patent on the technique we used) and it requires a large representative training set of existing files."

Out of curiosity, did you try to do it with a backpropagation network as well? If so, how did the results compare to the approach you ended up using?
clcr
Thursday, January 18, 2007
 
 
Actually, I didn't try using a backpropagation network, though many of the underlying principles (stochastic gradient descent) are similar to the algorithm I designed. But I didn't use an ANN.

My research started with byte histograms and ended up using a modified Markhov model for one particular part of the algorithm.

Other than those little tidbits, I'm not comfortable disclosing too many details. Juicy, huh?
BenjiSmith Send private email
Thursday, January 18, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz