The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Guessing text encoding

I have an app that parses text log files.  If the log file happens to have a Byte Order Mark, that makes it easy to detect if the file is UTF-8, UTF-16, UNICODE, etc.  However, many log files don't have such a marking.

I've been Googling around for some sort of library, best algorithm, etc that might help guess at the formatting of a file but haven't come across anything.  Is there such a beast?  Is it even possible?
Doug
Tuesday, November 06, 2007
 
 
John Topley Send private email
Tuesday, November 06, 2007
 
 
Thanks John.  I guess the short answer is there is no good way to guess :(
Doug
Tuesday, November 06, 2007
 
 
Yup.
You just have to know.
Also called "out-of-band data"
xampl Send private email
Tuesday, November 06, 2007
 
 
ICU (the open source library from IBM) has a nice class on character encoding detection:

http://www.icu-project.org/userguide/charsetDetection.html
Glitch
Tuesday, November 06, 2007
 
 
The Unix "file" command is good value.
Tim
Wednesday, November 07, 2007
 
 
If you drank the COM koolaid, check out IMultiLanguage2::DetectInputCodepage

Using Python, there's a pretty good guessing module out there called chardet
http://chardet.feedparser.org/

Here's a Python function to guess encodings (hoping the indentation holds...):

import chardet

def get_encoding(text):
    """Get the encoding of the current text file.
   
    Will also cut off the BOM at the beginning of the file.
   
    @param text: The raw text read from the file
   
    @returns: The text (less any BOM) and the encoding
    """

    if text.startswith(chr(0xEF) + chr(0xBB) + chr(0xBF)):
        return text[3:], 'utf-8'
       
    elif text.startswith(chr(0xFF) + chr(0xFE)):
        return text[2:], 'utf-16'

    elif text.startswith(chr(0xFE) + chr(0xFF)):
        return text[2:], 'UTF-16BE'
       
    else:
        detection = chardet.detect(text)
        if detection['encoding']:
            return text, detection['encoding']
        else:
            return text, 'utf-16'
Ryan Ginstrom
Wednesday, November 07, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz