The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Guessing text encoding

I have an app that parses text log files.  If the log file happens to have a Byte Order Mark, that makes it easy to detect if the file is UTF-8, UTF-16, UNICODE, etc.  However, many log files don't have such a marking.

I've been Googling around for some sort of library, best algorithm, etc that might help guess at the formatting of a file but haven't come across anything.  Is there such a beast?  Is it even possible?
Tuesday, November 06, 2007
John Topley Send private email
Tuesday, November 06, 2007
Thanks John.  I guess the short answer is there is no good way to guess :(
Tuesday, November 06, 2007
You just have to know.
Also called "out-of-band data"
xampl Send private email
Tuesday, November 06, 2007
ICU (the open source library from IBM) has a nice class on character encoding detection:
Tuesday, November 06, 2007
The Unix "file" command is good value.
Wednesday, November 07, 2007
If you drank the COM koolaid, check out IMultiLanguage2::DetectInputCodepage

Using Python, there's a pretty good guessing module out there called chardet

Here's a Python function to guess encodings (hoping the indentation holds...):

import chardet

def get_encoding(text):
    """Get the encoding of the current text file.
    Will also cut off the BOM at the beginning of the file.
    @param text: The raw text read from the file
    @returns: The text (less any BOM) and the encoding

    if text.startswith(chr(0xEF) + chr(0xBB) + chr(0xBF)):
        return text[3:], 'utf-8'
    elif text.startswith(chr(0xFF) + chr(0xFE)):
        return text[2:], 'utf-16'

    elif text.startswith(chr(0xFE) + chr(0xFF)):
        return text[2:], 'UTF-16BE'
        detection = chardet.detect(text)
        if detection['encoding']:
            return text, detection['encoding']
            return text, 'utf-16'
Ryan Ginstrom
Wednesday, November 07, 2007

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz