1. Concepts & Definitions
1.2. Central Limit Theorem (CLT)
1.5. Confidence interval and normal distribution
1.6. Applying normal confidence interval
1.7. Normal versus Student's T distributions
1.8. Confidence interval and Student T distribution
1.9. Applying Student T confidence interval
1.10. Estimating sample size using normal distribution
1.11. Estimating sample size using Student T distribution
1.12. Estimating proportion using samples
2. Problem & Solution
2.1. Confidence interval for weight of HS6 code
The next developments have the objective to employ confidence interval estimation to compute the proportion of containers with issues. This proportion will be used as a probability of hypergeometric distribution to find at least one container with an issue for an inspection batch with a certain sample size n. Four main steps have been carried out to attain this objective:
Create artificial data of a population of containers in which the proportion of ones with issues is known.
Extract a random sample from the population.
From the sample, create a confidence interval to estimate the proportion of products with some kind of issue.
Use the mean, lower, or upper bounds of the confidence interval as the probability of a product with an issue. Then, employ a hypergeometric distribution to obtain the probability of rejecting the entire lot considering at least one container with an issue had been found.
The next code randomly produces 10000 containers with the marker 'Ok', or 'Issue' in the proportion of '99%', or '1%', respectively:
import numpy as np
N = 10000
po = 0.99
pi = 1-po
np.random.seed(42)
port_seq = np.random.choice(['Ok','Issue'], p=[po, pi], size=N)
port_seq
array(['Ok', 'Ok', 'Ok', ..., 'Ok', 'Ok', 'Ok'], dtype='<U5')
The next code employs command collections.Counter to count the number of occurrences of containers that can be classified as 'Ok', or 'Issue':
import collections
containers_counter = collections.Counter(port_seq)
print(containers_counter)
Counter({'Ok': 9914, 'Issue': 86})
To extract only the values the following command could be employed:
containers_counter.values()
dict_values([9914, 86])
This command is useful to compute the proportion of containers with and without an issue:
proportion_sample = list(containers_counter.values())
pok = proportion_sample[0]/N
pis = proportion_sample[1]/N
print('Proportion ok = ',pok)
print('Proportion issue = ',pis)
Proportion ok = 0.9914
Proportion issue = 0.0086
The next code extracts a sample with a certain size n = 100, using the command random.sample, and its proportion of containers 'Ok' and with an 'Issue':
import random
n = 100 # sample size
sample_seq = random.sample(list(port_seq), n)
sample_seq
['Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Issue', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Issue', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Issue', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok', 'Ok']
The next code computes the proportion of containers with and without issues:
containers_counter = collections.Counter(sample_seq)
proportion_sample = list(containers_counter.values())
pok = proportion_sample[0]/n
pis = proportion_sample[1]/n
print('Proportion ok = ',pok)
print('Proportion issue = ',pis)
Proportion ok = 0.97
Proportion issue = 0.03
With the sample data is possible to compute the confidence interval employing the developments and code explained in Track 07, section 1.5, and Track 07, section 1.6:
from scipy.stats import norm
# Proportion of each event
pc = pok
qc = pis
# Confidence level and its related z \alpha.
p = 0.90
alfa = 1-p
pr = 1 - alfa/2
muz = 0
sigmaz = 1
z = norm.ppf(pr,muz,sigmaz)
sigmax = (pc*qc/n)**(0.5)
marg = z*sigmax
mux1 = pc - marg
mux2 = pc + marg
print("CI with ",p*100," % = [",round(mux1*100,2),"%, ",round(mux2*100,2),"%] ")
CI with 90.0 % = [ 94.19 %, 99.81 %]
The following code shows the probability of at least one issue depending on the sample size and shows the corresponding graphic. This code is based on the developments made and explained at Track 05 - Section 2.3. The proportion of containers with issues will be the mean found and stored in the variable qc.
from scipy.stats import hypergeom
import matplotlib.pyplot as plt
import numpy as np
# computing P(X) for each X.
x = [0]
N = 10000
K = int(N*qc)
print('K = ',K)
ns = list(range(0,100))
y = []
for n in ns:
res = hypergeom(N, K, n).pmf(x)
y.append(1-res)
plt.plot(ns,y,'-r')
plt.xlabel('n (Sample Size)')
plt.ylabel('Probability find issue (P(X>=1))')
plt.grid()
plt.show()
One interesting observation is that the probability of P(X >= 1) has a non-linear relation with the sample size in a manner that the sample size n = 50 corresponds to a probability of finding at least one issue P(X >= 1) is almost 80%. But, if the sample size increases from 50 to 100, the probability P(X >= 1) will only increase less than 20%.
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1uZM1ikUD2AGlAXO_wUn4aDw7nU7gnThA?usp=sharing