Discussion

(Note, this was written for MurmurHash 1.0 and is only loosely applicable to 2.0)

I'm going to put random brain dump stuff here.

The constants for MurmurHash were found by searching for parameters that fit the following conditions -

1. The mixing step

x *= m;
x ^= x >> r1;

should achieve nearly complete avalanche after two iterations.

2. The mixing step

x *= m;
x ^= x >> r1
x *= m;
x ^= x >> r2
x *= m;
x ^= x >> r3

should achieve nearly complete avalanche.

3. The distribution of the hash as a whole should produce a mimimal chi-square value on both easy (dictionary) and hard (sparse) keysets.

A bit of testing indicated that r1 pretty much had to be 16. Once I'd found a constant that passed condition 1, exhaustive search determined that r2 and r3 were best set at 10 and 17. From there I searched iteratively - find a value of m that improves on condition 3, test it against conditions 1 and 2 (searching for new r2 and r3 values), keep the new m if it's acceptable, rinse and repeat.

The current value of m (0xc6a4a793) produces an avalanche bias of 0.15% for condition 2 - I've found constants that produce values as low as 0.09%, but they don't fare as well on the chi-square test.

Andres Valloud mentioned that the avalanche condition isn't sufficient for proving a hash is random, and he's clearly right -

unsigned int PassesAvalancheButIsAwful ( const void * blob, int len )
{
return (MD5(blob,len) & 1) ? 0xFFFFFFFF : 0;
}

will pass the avalanche test even though it can only produce two possible values (it of course would fail the chi-square test).

Similarly, for a given hash table size you can create a hash function that passes the chi-square test with flying colors but fails the avalanche test -

unsigned int PassesChiSquaredButIsAwful ( const void * blob, int len )
{
return MD5(blob,len) % table_size;
}

I suspect you can construct one that would pass for all table sizes but still fail avalance, but I'm not certain how to go about that.

Anyhow, it seems that the two tests together do a good job of weeding out poor hash functions - chi-square to catch "random" but non-uniform distributions, avalanche to catch good distributions but poor mixing.

You could also throw the bit independence critera (BIC) into the mix, which is similar to avalance but adds an extra dimension - for each 1-bit input differential, compute the output differential and see how often each possible pair of output bits flip - the values 00, 01, 10, and 11 should appear equally.

I'm not certain how applicable this is to hash functions though - Murmur actually has some significant weaknesses in the BIC test, yet they don't appear to affect the quality of its output. Flipping the final mix shift values from (10,17) to (17,10) seems to fix this, at a cost of a few slightly biased (2%) bits in the avalanche result. Doing so doesn't improve any of the actual test results though, so I don't think it's worth worrying about.