The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

question on validating utf-8 input

Best practice for validating input to a web form or similar suggests that the validation should be inclusive. Eg the regex


checks that the content is alphanumeric plus underscore. Also we might include whitespace and limited punctuation if appropriate.

But suppose we have a form where the input can be in ANY language, UTF8? What then?

Obviously the first thing is to check that the input is valid UTF8. But then? How do we keep out the nasty escape sequences and whatever other nastiness exists, while knowing that we won't reject someone who submits perfectly normal text in Korean?
revert my buffer
Saturday, January 28, 2006
That's one of my pet peeves with people who think that just by mentioning that they use Unicode their souls are saved. :-)

The world is still too ASCII for my taste and Unicode is used mostly for translations. And Unicode with 8 bits? Why not 32 bits to fix the problem once and for all? Performance optimization? Once again? Blah...

To answer your question, handling text in a very specific way will probably cause problems if Unicode is to be universal. For instance, some symbols already represent whole words if not entire phrases in some languages. So, someone could write his nickname in only one symbol, like "A", or "B", and it would mean "Fire" or "I'm on fire". :-) Take the lots of RFCs that describe the Internet protocol, and I wouldn't be surprised if 99% of them use ASCII codification, and once you start escaping characters, you don't know for sure if some text is escaped on purpose or if it was escaped by some automatic convertion.

Summing up, maybe it's better to allow only some characters than to falsely say that you support Unicode and you don't.
Lost in a code jungle
Saturday, January 28, 2006
You usually want to use the input for some other action, e.g. a database query, an external programm called with exec, system or whatever there is, or anything else. Find out which characters are dangerous for the specific case (and you better get a COMPLETE list) and how you can escape them, e.g. substitute any " in the input with \" and surround the complete input with "", after looking for valid UTF-8.

This is no real validation of the input, of course, but a validation requires a definition of the correct inputs. Thus your alternatives are either to find out what is a correct input for ANY single part of the world - for ANY input field, or to search and possibly pay for a library doing this for you.
Saturday, January 28, 2006
Going to and looking up "validating input" in the search box gave me:

which might help.

But it may be that you need locale-specific regexps and need to set them up locale by locale.
EKB Send private email
Sunday, January 29, 2006
good link. Thanks. I'll keep it for future reference.
revert my buffer
Monday, January 30, 2006
I'm guessing you are going to go squirting this into a database?  This is my rather narrower interpretation of your OP :-)

Is it that you want to stop a dumb sql-isation of user input turning into an sql injection attack?
new nick, new rep
Monday, January 30, 2006
The q is academic right now. There is one place in my app where teh user dets to type in "free text" taht can't be restricted as per my original post. But it doesn't go anywhere - it's not used in a db query, saved to a file, or whatever. Just used to construct another page which is displayed and then forgotten.

But I can forsee future extensions where I will want to do this, so I am thinking in advance.
revert my buffer
Monday, January 30, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz