Here we will extend the concepts previously presented in the article detailing the binomial distribution. The hypergeometric distribution characterizes the process of selecting a subset of items without replacement. An outcome is defined by the number of successes (picks of the desired type) within that subset. It requires defining a set of three parameters :
N population size
K number of desired type in population (successes)
n number of trials per process
Continuing with the bag/balls example, the total number of balls in the bag would be the population size, N = 10. The number of red balls, the desired type, sets K = 4. Unlike the binomial distribution, this distribution does not replace items between trials. Thus the value of n can also be called the "sample" size. The above plot shows the results of 1000 processes where the sample size is n = 3. At such, randomly choosing three items, without replacing, results in an outcome of one success (per three picks) a significant number of times.
To illustrate how the parameters effect the distributions, now assume that the system evaluated has 100 items. A sample of 6% of the population is taken. The following plots illustrate the effect of the change in the number of items of desired type.
With a larger value of K, we see that it is more likely to randomly select multiple successes within the sample of six items.
A lower proportion of successes, setting K = 6, results in finding significantly fewer successes within the sample of six items.
This is distribution is most applicable to quality testing. Here we explore a couple use cases, such as ...
Example A
After serving as CFO of your previous company, you are now an independent business person developing an industrial strength vacuum. One of your assembly lines has been identified as the root cause of pre-warranty failures. Your current quality check inspects 1% of each batch of 1485 sub-components. Historical company data for full batch tests shows that when the primary sucking engine is improperly seated results in per-warranty failure rate of about 3%. Is a bad batch likely to be discovered given the current quality checks?
You define a hypergeometric distribution where:
N =1485
K =1485 x 0.03 = 44
n = 1485 x 0.01 = 14
You plot the outcomes of 1000 processes.
You determine that it is less likely to identify the problem because in approximately 65% of the simulated processes, zero improperly seated sucking engines were identified.
Example B
You want to modify either your quality check process to improve the likelihood of determining when the improperly seated primary sucking unit error is caught. What modification should be made?
You define and plot two more hypergeometric distributions of 1000 processes, altering the sample size, n.
N = 1485, K = 1485 x 0.03 = 44, n = 1485 x 0.023 = 32
You continually reduce the inspection rate and determine the smallest sample size that would meet your objective to be 2.3% (n = 32). At this level, it appears to be equally likely to identify zero or one errors per sample. Thus, the rate should be larger than this. (The current level of n = 14 is not at all sufficient.)
N = 1485, K = 1485 x 0.03 = 44, n = 1485 x 0.03 = 44
Increasing the inspection rate to a value equatable to that of the historical data fault rate of 3% (n = 44), it is more likely to find one or more errors per sample. Such would be an improvement on identifying a bad batch. However, there remains, still, a large probability of identifying zero errors in a sample. (That is not good.)
When appropriately applied the hypergeometric distribution can provide some powerful insights when applied to systems models. Though, in most cases, it is not always possible to know the exact error rate as was provided in the examples. Determining the value of K requires some specific system knowledge. With a process expert/owner, K can be estimated. Then, the model can be manipulated, varying the estimated value of K, and the sample size, n, to meet your business' objectives.
Using the hypergeometric distribution can aid in modelling your businesses processes, allowing for better data-driven decisions.
YHWH, Though we may be part of bad batch, we thank you for forgetting that and giving us your goodness.