2.3. Inspection and proportion estimate

The next developments have the objective to employ confidence interval estimation to compute the proportion of containers with issues. This proportion will be used as a probability of hypergeometric distribution to find at least one container with an issue for an inspection batch with a certain sample size n. Four main steps have been carried out to attain this objective:

Create artificial data of a population of containers in which the proportion of ones with issues is known.
Extract a random sample from the population.
From the sample, create a confidence interval to estimate the proportion of products with some kind of issue.
Use the mean, lower, or upper bounds of the confidence interval as the probability of a product with an issue. Then, employ a hypergeometric distribution to obtain the probability of rejecting the entire lot considering at least one container with an issue had been found.

Creating an artificial data

The next code randomly produces 10000 containers with the marker 'Ok', or 'Issue' in the proportion of '99%', or '1%', respectively:

import numpy as np

N = 10000

po = 0.99

pi = 1-po

np.random.seed(42)

port_seq = np.random.choice(['Ok','Issue'], p=[po, pi], size=N)

port_seq

array(['Ok', 'Ok', 'Ok', ..., 'Ok', 'Ok', 'Ok'], dtype='<U5')

The next code employs command collections.Counter to count the number of occurrences of containers that can be classified as 'Ok', or 'Issue':

import collections

containers_counter = collections.Counter(port_seq)

print(containers_counter)

Counter({'Ok': 9914, 'Issue': 86})

To extract only the values the following command could be employed:

containers_counter.values()

dict_values([9914, 86])

This command is useful to compute the proportion of containers with and without an issue:

proportion_sample = list(containers_counter.values())

pok = proportion_sample[0]/N

pis = proportion_sample[1]/N

print('Proportion ok = ',pok)

print('Proportion issue = ',pis)

Proportion ok = 0.9914

Proportion issue = 0.0086

Extracting a sample

The next code extracts a sample with a certain size n = 100, using the command random.sample, and its proportion of containers 'Ok' and with an 'Issue':

import random

n = 100 # sample size

sample_seq = random.sample(list(port_seq), n)

sample_seq

['Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Issue', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Issue', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Issue', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok']

The next code computes the proportion of containers with and without issues:

containers_counter = collections.Counter(sample_seq)

proportion_sample = list(containers_counter.values())

pok = proportion_sample[0]/n

pis = proportion_sample[1]/n

print('Proportion ok = ',pok)

print('Proportion issue = ',pis)

Proportion ok = 0.97

Proportion issue = 0.03

Computing the confidence interval

With the sample data is possible to compute the confidence interval employing the developments and code explained in Track 07, section 1.5, and Track 07, section 1.6:

from scipy.stats import norm

# Proportion of each event

pc = pok

qc = pis

# Confidence level and its related z \alpha.

p = 0.90

alfa = 1-p

pr = 1 - alfa/2

muz = 0

sigmaz = 1

z = norm.ppf(pr,muz,sigmaz)

sigmax = (pc*qc/n)**(0.5)

marg = z*sigmax

mux1 = pc - marg

mux2 = pc + marg

print("CI with ",p*100," % = [",round(mux1*100,2),"%, ",round(mux2*100,2),"%] ")

CI with 90.0 % = [ 94.19 %, 99.81 %]

Computing the confidence interval

The following code shows the probability of at least one issue depending on the sample size and shows the corresponding graphic. This code is based on the developments made and explained at Track 05 - Section 2.3. The proportion of containers with issues will be the mean found and stored in the variable qc.

from scipy.stats import hypergeom

import matplotlib.pyplot as plt

import numpy as np

# computing P(X) for each X.

x = [0]

N = 10000

K = int(N*qc)

print('K = ',K)

ns = list(range(0,100))

y = []

for n in ns:

res = hypergeom(N, K, n).pmf(x)

y.append(1-res)

plt.plot(ns,y,'-r')

plt.xlabel('n (Sample Size)')

plt.ylabel('Probability find issue (P(X>=1))')

plt.grid()

plt.show()

One interesting observation is that the probability of P(X >= 1) has a non-linear relation with the sample size in a manner that the sample size n = 50 corresponds to a probability of finding at least one issue P(X >= 1) is almost 80%. But, if the sample size increases from 50 to 100, the probability P(X >= 1) will only increase less than 20%.

The previous complete code is available in the following link:

https://colab.research.google.com/drive/1uZM1ikUD2AGlAXO_wUn4aDw7nU7gnThA?usp=sharing