The Law of Small Numbers

Random Numbers

Consider the following series of randomly selected 1's and 2's. Each selection was done with a random number generator that assigned a 0.5 probability (50%) of obtaining either a 1 or a 2 (just as one would expect with a fair coin-flip).

Each line represents one "trial" and each trial has 20 repetitions. Five trials were performed.

11222111211121121121

11111112111212222212

22112222211122211121

22212212121121222211

21221121121222122222

The first trial suggests that a 1 occurs 13 out of 20 times. The second suggest that a 1 occurs 12 out of 20 times. It is not until all five trials (100 events) are combined that we begin to see statistics closer to the 50-50 frequency that we should expect. Specifically, we see:

1’s = 48%

2’s = 52%

We would have been fooled about the true nature of the event if we stopped at just one trial or even after two.

Now consider what might happen if we were motivated to show that a 1 result is more likely than a 2. We could choose to only include the first trial in our report. However, we might be accused of choosing only the trials that support our desired result and not including all of the data ("cherry-picking"). Instead, we could decide to divide our five trials into smaller trials after-the-fact (post hoc).

For instance, the first trial could be divided as follows:

11222111211121121121

becomes

112 22111 2111 211 21121

Here, we present 5 smaller "trials". This appears to demonstrate that 5 out of 5 "trials" show that 1's are more common than 2's. Now let's do this with all of the five original trials:

112 22111 2111 211 21121 5 out of 5 favor 1

111 111 121 11212 222212 4 out of 5 favor 1

22 112 2222 111 22 211 121 4 out of 7 favor 1

222122 121 21121 2222 11 3 out of 5 favor 1

2122 112 11212 22122222 2 out of 4 favor 1

18 "trials" out of 27 favor 1 = 64%

The Law of Small Numbers and Hasty Generalizations

In order to really appreciate the honest mistakes and dishonest shenanigans that result from the clustering illusion, we need to appreciate the law of small numbers.

The law of small numbers is the fuel of the "Hasty Generalization" logical fallacy. The law basically states that we cannot estimate the actual frequency of certain events if the sample size is too small. In fact, small sample sizes will likely lead to completely wrong impressions about event frequency due to random clustering. If one were to make an inference about an event frequency based on too small of a sample size, this is a "hasty generalization".

For instance, in the above example, small series of random coin flips are not sufficient to demonstrate the true frequency of heads or tails (or in the above series, 1's and 2's). The first series of 20 would lead one to think that a 1 should occur 13 out of 20 times.

The Texas Sharpshooter Fallacy

This can be further exploited by dividing the small series into even smaller series post hoc, especially if one is motivated to demonstrate a false conclusion (in this case we are trying to manipulate the data to show that 1's are more frequent than 2's). This is akin to the legendary "Texas sharpshooter" who shoots wildly at the side of a barn, finds 3 bullet holes that just happen to be close together and then draws a target around the three holes to show how accurate (and precise) his shooting really (falsely) was.

By dividing the above series of numbers (after-the-fact) into convenient smaller groups (or "trials"), we can actually then count the number of groups that seem to obey the rule one wants to convey. In the above example, we see that 18 out of 27 "trials" agree that 1's are more likely than 2's. By this post-hoc manipulation, we are led to believe that any given trial in the future should expect to see more 1's than 2's 67% of the time.

The Law of Large Numbers states that for any series of trials of random events (like flipping a coin), the actual statistical frequency of the events in question will likely be seen until a large enough number of trials are performed. A study's "power" refers to how well it's number of trials can reasonably be expected to give a reliable number. Studies looking at events that are expected to be rare need a much larger number of trials than those looking for events expected to be common.

It is not hard to imagine such trickery being used in medical or other fields to convey false impressions on the public. What if, instead of a 50-50% chance of a number 1 or 2, we are talking about a side-effect of a drug? Or an opinion poll on the ethics of a public health policy?

The Clustering Illusion

We see how easy it is to manipulate raw data to convey a desired result. It is also easy for well-minded researchers to see clusters in data that are not actually meaningful.

Consider the third of our trials above:

22112222211122211121

What if this represented a distribution of a common ailment over a geographic distribution? Let's pretend that a 1 means that the person found at that location has normal blood pressure and a 2 means that the person there has high blood pressure (hypertension is not actually 50-50, but this is a thought experiment). We see that there is a larger-than-expected cluster of hypertension in the underlined group:

22112222211122211121

We might be tempted to think that this area on the map has some risk factor that increases the odds of a condition. Why would 100% of this population have hypertension? We would then be tempted to look for potential causes in this area. Perhaps there is a factory nearby. Perhaps there are more bars in the area than others. If one truly thinks that a cluster is significant, one may be tempted to find some correlation that "fits" the data.

This "clustering illusion" is the result of the law of small numbers. Of course, a cluster of cases may represent a real correlation. However, the law of small numbers predicts that we should expect to see such clusters and that we should be skeptical of their significance. That is...unless it our desire to deceive. Or perhaps, it is our desire to prove our pre-existing belief without skepticism.

Bob Carrol at the Skeptic's Dictionary puts it this way:

"Politicians, lawyers and some scientists tend to isolate clusters of diseases from their context, thereby giving the illusion of a causal connection between some environmental factor and the disease. What appears to be statistically significant (i.e., not due to chance) is actually expected by the laws of chance."

Conclusion

Richard Feynman is famous for saying that the trick is to not fool yourself, and you are the easiest one to fool (or something like that). By not considering a large enough sample size, we will naturally mistake the true frequency of events. By not realizing the natural clustering of small numbers of random samples, we may mistake clusters of events as significant where no significance exists. More notoriously, it is possible to manipulate data to misrepresent the true frequency of events.

A good skeptic should be aware of the Law of Small Numbers, the Law of Large Numbers, the Texas Sharpshooter Fallacy and the Clustering Illusion. They are all parts of the same, common error in our everyday thinking.

References

http://en.wikipedia.org/wiki/Hasty_generalization

http://updates.pain-topics.org/2011/03/new-research-links-nsaids-to-erectile.html

http://www.crab.rutgers.edu/~mbravo/cluster.pdf

http://www.pbs.org/wgbh/pages/frontline/programs/transcripts/1319.html

http://skepdic.com/texas.html