The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

String Hashing Functions

Hi,

Anybody got a source for a good algorithm to convert strings into unique integers, preferably without lookups?

Thanks,

Steve
Steve Hirsch Send private email
Wednesday, March 22, 2006
 
 
How unique are the strings, how is the variation distributed, how many are there, how fast does it have to be ?

If the ints must be unique just store the char memory as an int!
Or if space matters more than speed use base64 encoding
Or if speed doesn't matter use MD5
Or read sedgewick
Martin Send private email
Thursday, March 23, 2006
 
 
Assuming you have an appropriate encoding, just multiply together:

2^(first character)
3^(second character)
5^(third character)
7^(fourth character)
11^(fifth character)

and so on. You will need a list of primes or a means of producing such, of course. The fact that the integers are a unique factorization domain guarantees you a one to one mapping between integers and strings.

Oh, you meant a *32-bit* integer? Well, you aren't going to be able to get *unique* integers then, there are more than 4294967296 strings...
Larry Lard Send private email
Thursday, March 23, 2006
 
 
Just try to enter the topic title "string hashing functions" or "string hash functions" (without "") into Google... ;)
Secure
Thursday, March 23, 2006
 
 
Sedgewick, "Algorithms in C," 3rd Edition, p.579
Jeff Phillips Send private email
Thursday, March 23, 2006
 
 
A CRC makes a reasonable hash for non-cryptographic purposes.  Faster than MD5.  VERY fast if memory accesses don't incur a penalty (e.g. on slower embedded systems). Slower than other methods that may produce less desirable results.

I often do something like:

hv = s.size();
for (int i = 0; i < s.size(); i++)
  hv = hv * 37 + s[i];

This may be far from theoretically sound, but at least it's reasonably fast on modern systems.
David Jones Send private email
Thursday, March 23, 2006
 
 
If the strings are known in advance, then what you are looking for is a "perfect hash function".  As usual, Google is your friend.
CodeJunkie
Thursday, March 23, 2006
 
 
Incidentally...if you want to compare one hash implementation with another hash implementation, here's some code:

<code langauge="java">

int[] hashes = generateHashes(myStrings);
double[] vector = new double[32];
// Aggregate the values of the bits
for (int i = 0; i < hashes.length; i++) {
  int hash = hashes[i];
  for (int j = 0; j < 32; j++) {
    int thisBit = ((hash >> j) & 1);
    int magnitude = (thisBit == 0) ? -1 : 1;
    vector[j] += (double) thisBitValue;
  }
}
// Calculate the vector distance, using each bit's average
// value as a dimension into the vector.
for (int i = 0; i < 32; i++) {
  // In an ideal hash function, the dimensional
  // distance should be zero across all 32 bits.
  double dimensionDistance = Math.abs(vector[i] / hashes.length);
  runningSum += dimensionDistance;
}
// Determine the overall effectiveness of
// the hashing algorithm.
double vectorDistance = Math.sqrt(runningSum);
double maxDistance = Math.sqrt(32);
double wastePercentage = vectorDistance / maxDistance;
double hashQuality = 1.0d - wastePercentage;

</code>

Basically, this code treats each hash value as a vector into 32-dimensional space (with magnitudes normalized to a domain of +1 or -1).

Averaging a large number of those vectors, you should expect a perfect hash to have an average magnitude of zero across all 32 demensions, and a vectorDistance of zero.

The worst hash function ever (in which all strings hash to the same int value), would produce either a 1 or a -1 for the average bit value, in every dimension. So the vector distance would be the square root of 32.

Using this information, it's easy to get an exact measurement of the amount of waste in your hashing algorith. By dividing the vectorDistance by the maximum possilbe vector distance tells you what percentage of your hash bits are being wasted on redundancy.

Depending on your application--and the types of strings you typically see--a CRC might be the best algorithm. But maybe MD5 is better. Do you tend to see really short strings, or really long strings, or a mixture of the two? Which type of hashing algorithm produces better results based on YOUR dataset?

You won't know until you measure.

Have fun :)
BenjiSmith Send private email
Thursday, March 23, 2006
 
 
Thanks. I really want to avoid having to look up values in a table, because Oracle won't let me then use fast snapshot refreshes (why, I can't tell you, but it won't).

I was summing the products of the characters with the next prime number; it worked well, but required a lookup.
Steve Hirsch Send private email
Thursday, March 23, 2006
 
 
Thanks, found it. RTFM. In PL/SQL, it's

dbms_utility.get_hash_value

if you're interested.
Steve Hirsch Send private email
Thursday, March 23, 2006
 
 
Be aware that the algorithm behind dbms_utility.get_hash_function is not guaranteed to remain unchanged between versions, so storing values across an upgrade may be "an issue".

Look at DBMS_OBFUSCATION.DESEncrypt also.
David Aldridge Send private email
Friday, March 24, 2006
 
 
Thanks. Turned out that all of these functions prevent a materialized view from fast refreshing, much to my chagrin. Therefore, going to have to hand code the CDC logic. Uggh.

Almost got there. At least I'll be able to use a straightforward sequence to generate the surrogate keys, instead of some obtuse hashing function.
Steve Hirsch Send private email
Monday, March 27, 2006
 
 
Sorry, maybe I'm being a bit obtuse here, but there has been plenty of work put into the development of the cryptographic hash functions. Unless you need to do trillions of these hashes at a time, why not use something from the SHA family? (Even MD5 would be better than trying to roll your own):

MD5 - broken, but fine for non-cryptographic purposes
SHA-1 - broken, but still standard (128-bit)
SHA-256 - better (256-bit)
SHA-512 - even better (512-bit)

I wouldn't suggest using SHA-192 SHA-384, since they are just truncated versions of the SHA-256 and SHA-512 algorithms respectively, and I am not aware of any research done to assure me that such a truncation doesn't significantly weaken the algorithm.

The above functions should be available on just about any platform/language you can find, and many products have them built in.
Michael Scovetta Send private email
Friday, March 31, 2006
 
 
"SHA-1 - broken, but still standard (128-bit)"

160 bit.

"I wouldn't suggest using SHA-192 SHA-384, since they are just truncated versions of the SHA-256 and SHA-512 algorithms respectively, and I am not aware of any research done to assure me that such a truncation doesn't significantly weaken the algorithm."

The truncation is done on the final result, the calculation is exactly the same, beside the different initialization vectors. At least for SHA-224, I haven't looked at SHA-192 for now. It DOES significantly weaken the algorithm, of course... from 256 down to 224 and from 512 down to 384 bits. ;)

Additionally note that SHA-512 uses the same algorithm as SHA-256, just optimized for 64 bit instead for 32 bits, with adjustments on constants and rotations.
Secure
Friday, March 31, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz