1. Concepts & Definitions
1.2. Central Limit Theorem (CLT)
1.5. Confidence interval and normal distribution
1.6. Applying normal confidence interval
1.7. Normal versus Student's T distributions
1.8. Confidence interval and Student T distribution
1.9. Applying Student T confidence interval
1.10. Estimating sample size using normal distribution
1.11. Estimating sample size using Student T distribution
1.12. Estimating proportion using samples
2. Problem & Solution
2.1. Confidence interval for weight of HS6 code
The next code helps to illustrate what is expected with the mean of the sample means when compared with mean of the population according to the Central Limit Theorem.
import random
import numpy as np
random.seed(12) # Setting the initial value for the random generator
pop_data=range(100) # assume a population with 100 values
samples_means=[] # store the mean of each samples
for i in range(10): # create 10 samples with reposition from 100 pop values
sample_data=random.sample(pop_data, k=10) # Each sample extract 10 values
samples_means.append(np.mean(sample_data)) # compute sample mean and store
mean_samples_means = np.mean(samples_means) # mean of the sample means
mean_pop = np.mean(pop_data) # mean of the population
print('Mean of sample means = ',mean_samples_means)
print('Mean of population = ',mean_pop)
Mean of sample means = 48.489999999999995
Mean of population = 49.5
As could be seen it is expected, the mean of the sample means is expected to a good approximation to the population mean. The next code helps to investigate the impact of doing this approximation through CLT by changing the number of samples, and the sample sizes.
import random
import numpy as np
def createPop(pop_size):
pop_data = range(pop_size)
return pop_data
def createSampleMean(pop_data, n_samples, sample_size):
samples_means=[]
for i in range(n_samples):
sample_data=random.sample(pop_data, k=sample_size)
samples_means.append(np.mean(sample_data))
mean_samples_means = np.mean(samples_means)
return mean_samples_means
pop_size = 1000
number_samples = [1, 5, 10, 20, 30]
sample_size = [0.1, 0.15, 0.20]
sample_size = [int(item*pop_size) for item in sample_size]
random.seed(12)
pop_data = createPop(pop_size)
for n_samples in number_samples:
print('Number of samples = ' + str(n_samples))
print('--------------------------------------------')
for s_size in sample_size:
mean_samples_means = createSampleMean(pop_data, n_samples, s_size)
mean_pop = np.mean(pop_data)
print('Sample size = ' + str(s_size))
print('Mean of sample means = ',mean_samples_means)
print('Mean of population = ',mean_pop)
print('--------------------------------------------')
Number of samples = 1
--------------------------------------------
Sample size = 100
Mean of sample means = 484.64
Mean of population = 499.5
Sample size = 150
Mean of sample means = 465.0733333333333
Mean of population = 499.5
Sample size = 200
Mean of sample means = 502.14
Mean of population = 499.5
--------------------------------------------
Number of samples = 5
--------------------------------------------
Sample size = 100
Mean of sample means = 500.434
Mean of population = 499.5
Sample size = 150
Mean of sample means = 486.8293333333333
Mean of population = 499.5
Sample size = 200
Mean of sample means = 507.47799999999995
Mean of population = 499.5
--------------------------------------------
Number of samples = 10
--------------------------------------------
Sample size = 100
Mean of sample means = 487.133
Mean of population = 499.5
Sample size = 150
Mean of sample means = 498.32
Mean of population = 499.5
Sample size = 200
Mean of sample means = 499.7755
Mean of population = 499.5
--------------------------------------------
Number of samples = 20
--------------------------------------------
Sample size = 100
Mean of sample means = 500.1955
Mean of population = 499.5
Sample size = 150
Mean of sample means = 506.05600000000004
Mean of population = 499.5
Sample size = 200
Mean of sample means = 499.24699999999996
Mean of population = 499.5
--------------------------------------------
Number of samples = 30
--------------------------------------------
Sample size = 100
Mean of sample means = 497.74600000000004
Mean of population = 499.5
Sample size = 150
Mean of sample means = 498.0259999999999
Mean of population = 499.5
Sample size = 200
Mean of sample means = 499.26233333333323
Mean of population = 499.5
--------------------------------------------
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1oqvEzWoshQngyh8dpv6qHkpkm0gFYRcI?usp=sharing
The results show that an increment in samples size in more important than increase the number of samples for number of samples equal or greater than 10. For sample size equal to 1 it is important to increase the sample size to achieve a more precise estimation of the population mean.
Another important feature is that the CLT (at least in some of its various forms) tells us that in the limit as n→∞ distribution of a single standardized sample mean converges to a normal distribution (under some conditions):
It means the CLT is perfectly acceptable for a single sample -- the CLT isn't about "samples of samples" or anything like that. In the classic version, it says that the distribution of the sample mean of n iid random variables, each with mean μ and standard deviation σ approaches that of a normal distribution as the sample sizes get bigger.
For a more detailed discussion I would recommend: