The Design of Software (CLOSED)A public forum for discussing the design of software, from the user interface to the code architecture. Now closed. 

The "Design of Software" discussion group has been merged with the main
Joel on Software discussion group.
The archives will remain online indefinitely. 
How unique are the strings, how is the variation distributed, how many are there, how fast does it have to be ?
If the ints must be unique just store the char memory as an int! Or if space matters more than speed use base64 encoding Or if speed doesn't matter use MD5 Or read sedgewick
Assuming you have an appropriate encoding, just multiply together:
2^(first character) 3^(second character) 5^(third character) 7^(fourth character) 11^(fifth character) and so on. You will need a list of primes or a means of producing such, of course. The fact that the integers are a unique factorization domain guarantees you a one to one mapping between integers and strings. Oh, you meant a *32bit* integer? Well, you aren't going to be able to get *unique* integers then, there are more than 4294967296 strings...
Just try to enter the topic title "string hashing functions" or "string hash functions" (without "") into Google... ;)
Secure Thursday, March 23, 2006
A CRC makes a reasonable hash for noncryptographic purposes. Faster than MD5. VERY fast if memory accesses don't incur a penalty (e.g. on slower embedded systems). Slower than other methods that may produce less desirable results.
I often do something like: hv = s.size(); for (int i = 0; i < s.size(); i++) hv = hv * 37 + s[i]; This may be far from theoretically sound, but at least it's reasonably fast on modern systems.
If the strings are known in advance, then what you are looking for is a "perfect hash function". As usual, Google is your friend.
CodeJunkie Thursday, March 23, 2006
Incidentally...if you want to compare one hash implementation with another hash implementation, here's some code:
<code langauge="java"> int[] hashes = generateHashes(myStrings); double[] vector = new double[32]; // Aggregate the values of the bits for (int i = 0; i < hashes.length; i++) { int hash = hashes[i]; for (int j = 0; j < 32; j++) { int thisBit = ((hash >> j) & 1); int magnitude = (thisBit == 0) ? 1 : 1; vector[j] += (double) thisBitValue; } } // Calculate the vector distance, using each bit's average // value as a dimension into the vector. for (int i = 0; i < 32; i++) { // In an ideal hash function, the dimensional // distance should be zero across all 32 bits. double dimensionDistance = Math.abs(vector[i] / hashes.length); runningSum += dimensionDistance; } // Determine the overall effectiveness of // the hashing algorithm. double vectorDistance = Math.sqrt(runningSum); double maxDistance = Math.sqrt(32); double wastePercentage = vectorDistance / maxDistance; double hashQuality = 1.0d  wastePercentage; </code> Basically, this code treats each hash value as a vector into 32dimensional space (with magnitudes normalized to a domain of +1 or 1). Averaging a large number of those vectors, you should expect a perfect hash to have an average magnitude of zero across all 32 demensions, and a vectorDistance of zero. The worst hash function ever (in which all strings hash to the same int value), would produce either a 1 or a 1 for the average bit value, in every dimension. So the vector distance would be the square root of 32. Using this information, it's easy to get an exact measurement of the amount of waste in your hashing algorith. By dividing the vectorDistance by the maximum possilbe vector distance tells you what percentage of your hash bits are being wasted on redundancy. Depending on your applicationand the types of strings you typically seea CRC might be the best algorithm. But maybe MD5 is better. Do you tend to see really short strings, or really long strings, or a mixture of the two? Which type of hashing algorithm produces better results based on YOUR dataset? You won't know until you measure. Have fun :)
Thanks. I really want to avoid having to look up values in a table, because Oracle won't let me then use fast snapshot refreshes (why, I can't tell you, but it won't).
I was summing the products of the characters with the next prime number; it worked well, but required a lookup.
Be aware that the algorithm behind dbms_utility.get_hash_function is not guaranteed to remain unchanged between versions, so storing values across an upgrade may be "an issue".
Look at DBMS_OBFUSCATION.DESEncrypt also.
Thanks. Turned out that all of these functions prevent a materialized view from fast refreshing, much to my chagrin. Therefore, going to have to hand code the CDC logic. Uggh.
Almost got there. At least I'll be able to use a straightforward sequence to generate the surrogate keys, instead of some obtuse hashing function.
Sorry, maybe I'm being a bit obtuse here, but there has been plenty of work put into the development of the cryptographic hash functions. Unless you need to do trillions of these hashes at a time, why not use something from the SHA family? (Even MD5 would be better than trying to roll your own):
MD5  broken, but fine for noncryptographic purposes SHA1  broken, but still standard (128bit) SHA256  better (256bit) SHA512  even better (512bit) I wouldn't suggest using SHA192 SHA384, since they are just truncated versions of the SHA256 and SHA512 algorithms respectively, and I am not aware of any research done to assure me that such a truncation doesn't significantly weaken the algorithm. The above functions should be available on just about any platform/language you can find, and many products have them built in.
"SHA1  broken, but still standard (128bit)"
160 bit. "I wouldn't suggest using SHA192 SHA384, since they are just truncated versions of the SHA256 and SHA512 algorithms respectively, and I am not aware of any research done to assure me that such a truncation doesn't significantly weaken the algorithm." The truncation is done on the final result, the calculation is exactly the same, beside the different initialization vectors. At least for SHA224, I haven't looked at SHA192 for now. It DOES significantly weaken the algorithm, of course... from 256 down to 224 and from 512 down to 384 bits. ;) Additionally note that SHA512 uses the same algorithm as SHA256, just optimized for 64 bit instead for 32 bits, with adjustments on constants and rotations.
Secure Friday, March 31, 2006 
Powered by FogBugz