1.4. More results on CLT

The next code helps to illustrate what is expected with the mean of the sample means when compared with mean of the population according to the Central Limit Theorem.

import random

import numpy as np

random.seed(12) # Setting the initial value for the random generator

pop_data=range(100) # assume a population with 100 values

samples_means=[] # store the mean of each samples

for i in range(10): # create 10 samples with reposition from 100 pop values

sample_data=random.sample(pop_data, k=10) # Each sample extract 10 values

samples_means.append(np.mean(sample_data)) # compute sample mean and store

mean_samples_means = np.mean(samples_means) # mean of the sample means

mean_pop = np.mean(pop_data) # mean of the population

print('Mean of sample means = ',mean_samples_means)

print('Mean of population = ',mean_pop)

Mean of sample means = 48.489999999999995

Mean of population = 49.5

As could be seen it is expected, the mean of the sample means is expected to a good approximation to the population mean. The next code helps to investigate the impact of doing this approximation through CLT by changing the number of samples, and the sample sizes.

import random

import numpy as np

def createPop(pop_size):

pop_data = range(pop_size)

return pop_data

def createSampleMean(pop_data, n_samples, sample_size):

samples_means=[]

for i in range(n_samples):

sample_data=random.sample(pop_data, k=sample_size)

samples_means.append(np.mean(sample_data))

mean_samples_means = np.mean(samples_means)

return mean_samples_means

pop_size = 1000

number_samples = [1, 5, 10, 20, 30]

sample_size = [0.1, 0.15, 0.20]

sample_size = [int(item*pop_size) for item in sample_size]

random.seed(12)

pop_data = createPop(pop_size)

for n_samples in number_samples:

print('Number of samples = ' + str(n_samples))

print('--------------------------------------------')

for s_size in sample_size:

mean_samples_means = createSampleMean(pop_data, n_samples, s_size)

mean_pop = np.mean(pop_data)

print('Sample size = ' + str(s_size))

print('Mean of sample means = ',mean_samples_means)

print('Mean of population = ',mean_pop)

print('--------------------------------------------')

Number of samples = 1

--------------------------------------------

Sample size = 100

Mean of sample means = 484.64

Mean of population = 499.5

Sample size = 150

Mean of sample means = 465.0733333333333

Mean of population = 499.5

Sample size = 200

Mean of sample means = 502.14

Mean of population = 499.5

--------------------------------------------

Number of samples = 5

--------------------------------------------

Sample size = 100

Mean of sample means = 500.434

Mean of population = 499.5

Sample size = 150

Mean of sample means = 486.8293333333333

Mean of population = 499.5

Sample size = 200

Mean of sample means = 507.47799999999995

Mean of population = 499.5

--------------------------------------------

Number of samples = 10

--------------------------------------------

Sample size = 100

Mean of sample means = 487.133

Mean of population = 499.5

Sample size = 150

Mean of sample means = 498.32

Mean of population = 499.5

Sample size = 200

Mean of sample means = 499.7755

Mean of population = 499.5

--------------------------------------------

Number of samples = 20

--------------------------------------------

Sample size = 100

Mean of sample means = 500.1955

Mean of population = 499.5

Sample size = 150

Mean of sample means = 506.05600000000004

Mean of population = 499.5

Sample size = 200

Mean of sample means = 499.24699999999996

Mean of population = 499.5

--------------------------------------------

Number of samples = 30

--------------------------------------------

Sample size = 100

Mean of sample means = 497.74600000000004

Mean of population = 499.5

Sample size = 150

Mean of sample means = 498.0259999999999

Mean of population = 499.5

Sample size = 200

Mean of sample means = 499.26233333333323

Mean of population = 499.5

--------------------------------------------

The previous complete code is available in the following link:

https://colab.research.google.com/drive/1oqvEzWoshQngyh8dpv6qHkpkm0gFYRcI?usp=sharing

The results show that an increment in samples size in more important than increase the number of samples for number of samples equal or greater than 10. For sample size equal to 1 it is important to increase the sample size to achieve a more precise estimation of the population mean.

Another important feature is that the CLT (at least in some of its various forms) tells us that in the limit as n→∞ distribution of a single standardized sample mean converges to a normal distribution (under some conditions):

It means the CLT is perfectly acceptable for a single sample -- the CLT isn't about "samples of samples" or anything like that. In the classic version, it says that the distribution of the sample mean of n iid random variables, each with mean μ and standard deviation σ approaches that of a normal distribution as the sample sizes get bigger.

For a more detailed discussion I would recommend:

[1] https://stats.stackexchange.com/questions/211499/why-does-the-central-limit-theorem-work-with-a-single-sample

[2] https://math.stackexchange.com/questions/4742029/does-the-central-limit-theorem-work-for-a-single-sample