Guessing text encoding

I have an app that parses text log files.  If the log file happens to have a Byte Order Mark, that makes it easy to detect if the file is UTF-8, UTF-16, UNICODE, etc.  However, many log files don't have such a marking.

I've been Googling around for some sort of library, best algorithm, etc that might help guess at the formatting of a file but haven't come across anything.  Is there such a beast?  Is it even possible?
Tuesday, November 06, 2007
John Topley Send private email
Tuesday, November 06, 2007
Thanks John.  I guess the short answer is there is no good way to guess :(
Tuesday, November 06, 2007
You just have to know.
Also called "out-of-band data"
xampl Send private email
Tuesday, November 06, 2007
ICU (the open source library from IBM) has a nice class on character encoding detection:
Tuesday, November 06, 2007
The Unix "file" command is good value.
Wednesday, November 07, 2007
If you drank the COM koolaid, check out IMultiLanguage2::DetectInputCodepage

Using Python, there's a pretty good guessing module out there called chardet

Here's a Python function to guess encodings (hoping the indentation holds...):

import chardet

def get_encoding(text):
    """Get the encoding of the current text file.
    Will also cut off the BOM at the beginning of the file.
    @param text: The raw text read from the file
    @returns: The text (less any BOM) and the encoding

    if text.startswith(chr(0xEF) + chr(0xBB) + chr(0xBF)):
        return text[3:], 'utf-8'
    elif text.startswith(chr(0xFF) + chr(0xFE)):
        return text[2:], 'utf-16'

    elif text.startswith(chr(0xFE) + chr(0xFF)):
        return text[2:], 'UTF-16BE'
        detection = chardet.detect(text)
        if detection['encoding']:
            return text, detection['encoding']
            return text, 'utf-16'
Ryan Ginstrom
Wednesday, November 07, 2007

