The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Unicode standardization proposal

Look here everybody, UTF-8 and all the UTF-16s are huge design mistakes. Variable width characters is a COMPRESSION technique, not a reasonable character standard.

Use ASCII if you can, such as for configuration files and internal scripting. If you need a customer facing field, use UTF-32 and have your strings be arrays of longs. None of this insanity that makes processing strings suddenly turn in to the most complex part of any program, ok?

Thus spoke I, the emperor of all standards.
Scott
Wednesday, February 20, 2008
 
 
UTF-8 should not have been made a superset of ASCII.
dev1
Thursday, February 21, 2008
 
 
Well spoken.

Just yesterday I had to deal with the insanity of UTF-8, with some chars being 1 byte, some being 2 bytes, and some being 3 bytes. It causes all String processing to be difficult, even though String processing is one of the most common programming tasks, and therefore should be kept simple and fast.
Steve McLeod Send private email
Thursday, February 21, 2008
 
 
There is nothing wrong with UTF-8!

I suppose you are living under the delusion that all chars have a fixed width as well!?

If you really must operate under the (flawed) assumption that you can consider each character as a distinct character, then you can consider UTF-32, but that is a waste of space because EVERY unicode character fits in 16 bits! (Which is no-longer true, and hasn't been true for quite a few years now, but windows is slow to change).

Does anyone else consider the _irony_ of this 'proposal'? (Hint: Google for "The Absolute Minimum Every Software Developer Must Know About Unicode").

I, for one, wish that everybody would standardize on UTF-8!
Arafangion Send private email
Thursday, February 21, 2008
 
 
UTF8 or UTF32 makes sense, but UTF16 no longer does. On the UTF8 scale you're optimizing for space. On the UTF32 scale you're optimizing for simplicity.

The problem with UTF16 is that *sometimes* it's okay to look at individual characters, but *sometimes* it's not. This is far less likely to be true under UTF8. In fact, I wish it was *never* possible to examine individual bytes in UTF8 to be consistent :)

The original poster is right though: UTF8 is essentially a compression algorithm and it makes little sense to use it in the long run because (correct me if I'm wrong) space is a lot cheaper than CPU power. Always has been and always will be. The only time you're concerned about space is when you transfer over the wire but in that case you can use compression across your entire protocol.
Gili Send private email
Thursday, February 21, 2008
 
 
Wrong. UTF-32 is not simple either because of diacriticals and combining characters. The simplicity of the number of characters being the same as the memory length is ultimately just wishful thinking.

UTF-8 is the best because ASCII processing is simple, unlike other multi-byte systems which can have ASCII values (<128) as trailing characters which do not represent the ASCII chars.
Ben Bryant Send private email
Thursday, February 21, 2008
 
 
Consider using a language that doesn’t let you know how it stores strings in memory.
DAH
Thursday, February 21, 2008
 
 
UTF-8 and UTF-16 make sense. UTF-8 is not just for storage. It also can be used to extend programs written long time ago when only ASCII string appeared in programmer's mind. Old programs can be refactored without modifying its external interface. The NUL character in UTF-16 and UTF-32 always causes troubles in legacy programs.

For example, a file system API with initial design ASCII filename only can be extended with accepting Unicode filenames (using UTF-8), while maintaining its API intact.

The UTF-32 is a "political correct" enhancement and overkill. We probably won't see many programs supporting it.
Glitch
Thursday, February 21, 2008
 
 
UTF-32 is the same mistake UTF-16 was -- a futile notion that a character should fit in fixed memory (due to Unicode combining characters it never can).

The idea that variable length character storage makes string manipulation more complicated is simply untrue. It is a throwback mindset to the days of character based monitors and mainframe green screen programming where you always dealt with character lengths that corresponded to display lengths. But that is not the case anymore. Granted, developers are still hung up on the relationship between database column widths and string lengths, but that is also backwards thinking (just convert your database charsets to see how arbitrary column widths are).

Generally, all you really need to know is how long the string is in memory, not how many characters there are. And for most parsing purposes you are looking for specific ASCII characters and with UTF-8 you don't even need to know where character boundaries are because all non-ASCII bytes are >=128.
Ben Bryant Send private email
Thursday, February 21, 2008
 
 
I've never had to worry about this.  Don't you have a good library or API which is already more thoroughly tested than you could ever do yourself?
Cade Roux Send private email
Thursday, February 21, 2008
 
 
Even if you _don't_ have a good library, and you're not allowed to use one for some stupid reason, and your only interface to strings is plain C-style null-terminated char pointers, it's STILL utterly trivial to handle UTF-8 just fine.  It's a beautifully designed encoding with none of the difficulties or ambiguities of legacy MBCS encodings.

Anyone who can't take the spec and produce a working conformant set of UTF-8 string-handling routines within an hour should not be employed as a programmer.
Iago
Thursday, February 21, 2008
 
 
Does that hour include reading the spec?
clcr
Thursday, February 21, 2008
 
 
Folks, you store in UTF-8, you process in UCS-16/32. Note, UCS, not UTF. UTF is variable length encoding, UCS is fixed character size.

It turns out that UCS-16 and UTF-16 are identical over most of the character range so there's lots of confusion. Win32 uses UCS-16.
Chris Tavares Send private email
Thursday, February 21, 2008
 
 
I have some reasons to be believe that Unicode somehow went wrong. But you have to understand that many languages doesn't structure like occidental languages, and for practical goals, Unicode is working.

I agree that same-count-of-bytes-per-char is better than variable-count-of-bytes-per-char.

UTF-XX encodings use variable-count-of-bytes-per-char approach. There is an alternative UCS-XX encodings that support same-count-of-bytes-per-char, useful for programming.

The same problem already existed with the standard one-byte-per-char ASCII encoding, and the variable-count-of-bytes-per-char MBCS encoding...

UCS-4 ALWAYS uses 4 bytes per char, and as far as I know, it supports all known commonly used languages, from English to Chinese. Many developers currently use UCS-4 encoding internally in their applications, even if they have to load and save files as their UTF-XX counterparts.

Just my 2 cents.

Thursday, February 21, 2008
 
 
There is a misconception and misunderstanding throughout this whole thread, making the discussion quite futile. ANYWHERE where the term "character" is used here, it must be substituted by "Unicode code point".

http://www.joelonsoftware.com/articles/Unicode.html

http://en.wikipedia.org/wiki/Unicode
Secure
Sunday, February 24, 2008
 
 
When you use wide characters its only a data representation what the data hold is what is more important.

When you use a string literal in your code for example.
"Hello world", C++ or C will treat it as a ASCII string literal. When you place a L infront of the text then C/C++ will then treat it as a Unicode String literal.

The data type wchar_t is defined as the following within C++.
typedef unsigned short wchar_t;

So it holds 2^16-1 eg.. 0xffff or 65535 Unicode code points.  Now the problem arises that the current wchar_t does not actualy have the space required to HOLD all of the current Unicode 5.0 specifications. That is somewhere between 2^21 code points.

But it doesn't really matter for most applications. Because wchar_t or WCHAR can hold the neccearly Unicode points for the Basic Multilingual Plane. So you can mainly render all the known written system on the earth granted the person has the curret font installed to render those code points of unicode.

This is not even getting into encoding yet. Encoding is something you only have to worry about when you need to serialise or write to hard disk the UniCode code points. Mind you, if you used wchar_t you should be basicaly fine. But if you read some UTF8 that is actualy using above 0xFFFF uniCode points you kind of screwed.

The trouble i think most people get confused about, is that when they look at unicode specifications they see the ascii characters represented in the same range as your normal ASCII table. So they assume unicode just holds characters, when it actualy just hold code points instead, that lead to the rendering of that specific character.
Entity Send private email
Tuesday, February 26, 2008
 
 
Think of the operations you perform on a string. There are those where the encoding doesn't matter, and those where it does.

a) the encoding does NOT matter in the following operations because all you may need is the memory length for passing to system functions:
- moving the string around in memory
- comparing strings for the purpose of sorting
- searching for a substring gives you an offset

b) the encoding DOES matter in the following operations because you are parsing the string:
- extracting the first character
- locating the last character
- finding the next or previous character
- truncating

The main issue the OP is pointing to is the notion that each memory unit of the string holds one "character," not "code point." The OP thinks that moving to UTF-32 solves this problem for parsing operations but it does not.

I don't claim to know the perfectly accurate terminology, but by talking about a character the OP is obviously implying a unit of text that can be separated from the text around it with regard to parsing operations listed above. But Unicode defines combinations of Unicode code points "combining characters" that cannot be separated. This raises the prospect that there can be a point in a UTF-32 string where it cannot be divided, necessitating the use of "NextChar" types of functions for iterating through and parsing UTF-32 strings. So the idea that UTF-32 makes things simpler for the developer ultimately is just not true.
Ben Bryant Send private email
Tuesday, February 26, 2008
 
 
If nothing else, UTF8 being a suoperset of ASCII made it much more realistic for people to adopt it. This is a good thing.

UTF16, um, don't ask me.  :)

Tuesday, February 26, 2008
 
 
UTF-32 is still variable length encoding. In UTF-32, one "user character" can still occupy two 32-bit values. When you are selecting text, you cannot select between these two, you get either none or both, indeed they usually appear as a single thing. For a lot of examples of these see most of the entries in the Tamil "alphabet" are two code points each (try dragging your mouse across one of these "letters" to see it is treated as an atomic thing):

http://blogs.msdn.com/michkap/archive/2008/02/26/7898303.aspx

This is a recent post in Michael Kaplan's blog, which I highly recommend. Joel Spolsky does too:

http://www.joelonsoftware.com/items/2005/08/10.html
Ben Bryant Send private email
Wednesday, February 27, 2008
 
 
>a) the encoding does NOT matter in the following operations >because all you may need is the memory length for passing to >system functions:
>- moving the string around in memory
>- comparing strings for the purpose of sorting
>- searching for a substring gives you an offset


I am pretty sure encoding matters for the last 2. You can't do strictly numerical sorting since the base character and the the character with the accents should be next to each other. Similiarly for substring sorting, certain sequences should be counted as equivalents. Hopefully your language comes with routines to do both of these for you.

Wednesday, February 27, 2008
 
 
"You can't do strictly numerical sorting since the base character and the the character with the accents should be next to each other."

It is MUCH more complicated than that. It often depends on culture and location.

http://en.wikipedia.org/wiki/Collation

http://en.wikipedia.org/wiki/Unicode_Collation_Algorithm
Secure
Thursday, February 28, 2008
 
 
> am pretty sure encoding matters for the last 2

What I said was "for passing to system functions". Of course the encoding matters to the system function (not to mention that the programmer must pass a string of the right encoding). But the subject of this thread is about what the programmer has to deal with in general, and not the system programmer who is writing common system routines.
Ben Bryant Send private email
Thursday, February 28, 2008
 
 
Shouldn't you system functions just simply take wchar_t Unicode code point instead of the in-sanity of the system function taking UTF8/UTF16/ect...?

I see the point of having a conversion from UTF8 to unicode as system function, but moving around UTF8 as the internal string representation seems a little folly? is it not?
Entity Send private email
Friday, February 29, 2008
 
 
Entity,

"conversion from UTF8 to unicode"

Now that's an interesting statement. What does this "unicode" encoding look like and how is it stored in memory? What are you talking about?
Secure
Friday, February 29, 2008
 
 
I mean hold the Unicode code point in memory represented by two bytes, using wchar_t. Then when need be to serializing the Unicode code points out to hard disk covert it to UTF8 encoding format to save storage space. Then when reading from hard disk have your system functions convert from UTF8 to Unicode code point internal representation.

Though two bytes, only covers the BMP and not the full range of unicode code points. Would have to expand that to int32 to cover them all.
Entity Send private email
Friday, February 29, 2008
 
 
What i meant to say was.

Hold the internal encoding as UTF-16/UCS-2, then when serializing to hard disk to use UTF8. When loading from hard disk or database again covert from UTF8 encoding back to UCS-2 for your internal string representation.

I mixed up the meaning of Unicode code points and actually how to physically represent it in memory using encoding my bad.
Entity Send private email
Friday, February 29, 2008
 
 
Observation: Unicode is about handling indirections. The "characters" are represented by abstract code points. The code points are encoded into some abstract byte values like UTF-8. Double indirection. Most people don't seem to get this and are completely confused -- just like the problems with pointers.
Secure
Friday, February 29, 2008
 
 
Yes, in Windows you use wide char (UTF-16) in memory because that is what Win32 Unicode system functions require (otherwise you will have to convert strings before and after system calls). Although a code point usually fits in 2 bytes, it does not always; it did in the days of UCS-2, but that was in the 90s. When you serialize to disk, you have options, but single and multi-byte encoding are most common.

But again, the OP said to use UTF-32 in memory because a Unicode character would always fit in a long (32 bits), which is just not true. A code point will always fit in a long, but not a "user character" or "grapheme cluster" which is what matters when parsing strings.

The OP states variable length encoding is a compression technique and makes no sense for in-memory string manipulation, but since EVERY Unicode encoding is essentially a variable length encoding, then you might as well use the ones that require less memory.
Ben Bryant Send private email
Friday, February 29, 2008
 
 
@Ben

Just seems a bit more complex if you use UTF8 as in-memory string manipulation. For example, just to get to the code points you first need to decode UTF8, then to find the character or "grapheme cluster" you need to decode the next UTF8 portion to get the next code point. That if Im not mistaken you need to store in UTF32 anyway.

Where if you hold the in-memory string as UTF32 you can go straight to finding out the grapheme cluster by examining the code points by themselves without having to decode UTF8 that is a variable length encoding. Then to insert, needing to deal with where to insert into that variable length encoding.

Guess its just designer preference. To either swap memory or processing speed. It just seems a lot simpler and straight forward to deal with UTF32/UTF16 than UTF8 as the in-memory string manipulation.
Entity Send private email
Friday, February 29, 2008
 
 
Entity, thanks that was very clearly stated. I was thinking more along the lines of ignoring the code points altogether and having a system function like NextChar which takes you to the boundary of the next user character (grapheme cluster). I shouldn't have to know the details of code points if all I want is the boundary of user characters. And corresponding functions to go to previous character, last character, etc.

If you are using some grapheme cluster lookup, with UTF-8 you would probably have to convert N code points to UTF-16 or UTF-32. But all of this makes assumptions about the system functions you are using to get your character information.

Ultimately we don't have a choice of UTF-8 or UTF-32 in Windows because it is geared to UTF-16. In .NET you have these functions which encapsulate the surrogate pair handling, and I assume the user character handling. So it is worth trying to use UTF-32 or UTF-8 in memory when all of the APIs, Windows Messages and string classes work with UTF-16, unless there is another compelling reason.
Ben Bryant Send private email
Friday, February 29, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz