A friend of mine wrote me up on IM the other day, asking how to use a hash algorithm to generate short, unique identifiers out of a longer piece of information. I suggested that she ditch the hash idea and just generate random numbers. It's pretty much the same thing, after all, since she was going to be storing the lookup information in a database anyway. Her first (and fairly natural) reaction was "What about collisions?" After thinking about it for a second, she remembered that a hash isn't guaranteed to be unique - how can you come up with a one-to-one mapping from all possible four-page documents to a single 32-bit number? - so she had to have the code in place to check for duplicates anyway. But it did raise the question, "How likely is a collision to occur?" The answer turns out to be a variation of the Birthday Problem. You might have heard this one: in any group of 23 people, the odds of at least two people having the same birthday is about 50%, which is a lot higher than most people would expect. The relative likelihood of this sort of coincidence comes from the fairly large number of possible pairings: in order for there to be no shared birthday, each of the 506 possible pairs of people has to have a different birthday. So given that humans are terrible at doing probability in their heads, I wanted to make sure I could give my friend an analytic answer to whether or not she had to worry about collisions. After all, she had 500,000 documents to generate unique IDs for, and if collisions were going to be super common, it might make sense to think about doing it a different way. The basic formula for figuring out if there will ever be a collision between any two randomly selected items goes like this: P(n, m) = 1 - [n(n-1)(n-2)...(n-m+1)]/(n^m) where n is the number of possible choices (365 days for the birthday problem) and m is the number of selections we make (23 people, for example). In words, you could express this as "The odds of a collision at any time during m choices the same is one minus the odds no collisions at any time during m choices." The odds of there being no collisions is basically, "The odds that there will be no collision after one choice and no collisions after two choices and no collisions after three choices, etc. etc." Much easier to calculate that way - basically all we have to do is figure out the odds that every choice will be different from every other choice before it, and subtract that from one. Note the assumption I'm making here: that every choice is equally likely. If you're generating IDs, make sure you use a good random number generator - some of the ones out there I hear are truly awful. I personally use System.Security.Cryptography.RNGCryptoServiceProvider when I need cryptographically secure random number generation. Very simple code for doing this calculation is included at the end of this article. Run it from the command line specifying the number of possible choices first, then the number that we'll be choosing. My friend was generating a 8-character ID, where each character was one of 24 different choices letters/numbers, giving her a total of 24^8 or 110075314176 possible IDs. She had 500,000 documents that she wanted to generate IDs for. Running the program with these two arguments shows that there is a 67% chance of a collision ever - so in other words she should code for it. If she only had 10,000 documents, then the odds would be much lower, at around 0.045%. In this case she might just be able to ignore the problem altogether. Note my use of the Decimal type to allow for large numbers like 24^8. Using double should make it fairly accurate, but beware floating point rounding errors - for large numbers inaccuracies can creep in. I've found this code useful several times even since just last week. Perhaps you will too. using System; class CollisionCalculator { static void Main(string[] args) { Decimal possible = Decimal.Parse(args[0]); Decimal choose = Decimal.Parse(args[1]); double odds = CalculateOdds(possible, choose); Console.WriteLine("If there are {0} possibilities and you choose {1} times, the odds of a collision are {2}", possible, choose, odds); } static double CalculateOdds(Decimal possible, Decimal choose) { double odds = 1.0; for (Decimal n = 0; n < choose; ++n) { odds *= (double) (possible - n) / (double) possible; } return 1.0 - odds; } } ## AddendumKarsten, a math-saavy reader sent me the following comment: Just totally chanced into your page about how collisions between document IDs boil down to the birthday problem ... a jolly good approximation for the collison probability your program is calculating is 1- exp(- (#documents)^2/(2*#ID's)) eg. with 500,000 documents it's 1-exp(-500000^2/(2*24^8)) = 0.6788 To which I responded: That's one I didn't know - thanks! I'm looking at rearranging that page (amongst others) at some point - I'd love to include the approximation. Do you know of anywhere that has the proof so I can a) understand the derivation, and b) figure out what the error term is? And to which he responded with the following excellent explanation: ## The ExplanationAh, I'm afraid I don't have a reference - we did it in college yonks ago, and I was looking for somewhere to read up about the birthday problem myself when I found your page. The derivation runs like this though: Say I have N days prob(2 people have different b'days) = (N-1)/N (second person can have any day but the first one's) prob(3 people have different b'days) = (N-1)/N * (N-2)/N (IN ADDITION, third person can have any day but the first two) etc prob(k people have all different b'days) = (N-1)/N *(N-2)/N * ... * (N-(k-1))/N You'll find this documented frequently on the web. If you want the prob to be less than half and N=365 you can run a spreadsheet or spend 5 minutes on a calculator to find k>=23. To get a better handle on the long product, we need some approximations. This is slightly more ugly so you might not find it outside a math book. Apologies if this makes little sense! First divide out each term i.e. (N-1)/N = 1-1/N (this is still exact) to get prob(k all different) = (1- 1/N) * (1- 2/N) * ... * (1- (k-1)/N) Now we approximate (1-i/N) prob(k all different) = (1- 1/N) * (1- 1/N)^2 * ... * (1- 1/N)^(k-1) = (1- 1/N) ^ (1+2+...+(k-1)) = (1- 1/N) ^ (k*(k-1)/2) k-1 isn't going to be very different from k so this is basically ~=~ (1- 1/N) ^ ((k^2)/2) This is good enough already to solve the birthday problem in one go on a calculator - "just take logs". The only drawback is, raising a number extremely close to 1 to a large power (like N=24^8, k=500000) is going to be dangerous numerically. So, we can do one final approximation: you may have seen that (1+ x/N) formula: prob(k all different) = (1+ (-1)/N) ^ (N * (k^2)/(2*N)) ~= (exp(-k^2)) ^(2*N) = exp(-k^2/(2*N)) It's the end of a long work day so I can't guarantee I haven't made any typos. But I'm pretty sure the result is right. For example, you get an approximation to the general birthday problem: exp(-k^2/(2*N) < 0.5 <=> -k^2 /(2*N) < -log(2) <=> k > sqrt(2*log(2)*N) = 1.1774 * sqrt(N) which explains the approximation of 1.2* sqrt(N) which I found on a couple of webpages (look for "birthday 23 people" on google). Good luck with the error term!!! I think it's pretty small though... |

Craig's Writings > Odds & Ends >