1. Concepts & Definitions
1.1. Defining statistical test of hypothesis
1.2. Numerical example of test of hypothesis for mean
1.3. Code for test of hypothesis for mean
1.4. Code for right tailed test of hypothesis for mean
1.5. Code for left tailed test of hypothesis for mean
1.6. Code for small sample hypothesis for mean
1.7. P-Value and test of hypothesis
1.8. Statistical power and power analysis
1.9. Shapiro Wilk for normality test
2. Problem & Solution
2.1. Shapiro Wilk to verify CLT Simulator
Load the notebook with commands developed in Track 06 - step 2.1. (click on the link):
https://colab.research.google.com/drive/1Xo-2dWDgL-gmDJH3QmB6b4YMlntgQqtu?usp=sharing
2. Remember from previous section, the graph obtained:
Now, let's instead of using a filter for the values that are under 100000, consider a filter to consider values under 40000.
filter = df1['weight_kg'] < 40000
df1.loc[filter]
The following will appear:
The next code helps to draw the new distribution related to the filtered data frame:
weight = df1.loc[filter]['weight_kg']
weight.hist()
The next code computes the mean and standard deviation for all filtered data and it will be referenced as a population since includes all available data.
import numpy as np
pop_mean_weight = sum(weight)/len(weight)
#calculate standard deviation of list
pop_std_weight = np.std(list(weight))
print(pop_mean_weight)
print(pop_std_weight)
16105.766274318656
9747.901629742
The next code extract selects the filtered weight data and selects 100 samples each one with 30 randomly choosen values.
sample_size = 100
number_samples = 30
list_means_samples = []
for s in range(0, number_samples):
# Random_state parameter works like an initial seed for a random sampling.
sample = list(weight.sample(n=sample_size, random_state=s+3))
mean_sample = sum(sample)/len(sample)
list_means_samples.append(mean_sample)
list_means_samples
[15947.769067000001, 15870.988500000001, 17208.357780000002, 16519.316757, 15275.26839, 16086.101976000004, 16479.651276999997, 15510.297471000003, 14861.697939999998, 17155.882599999997, 16775.280599999995, 17941.656986, 16126.398980000004, 16058.165267, 15960.026557, 16497.744357, 16601.3141, 14700.071201000002, 15717.66677, 15991.896503999998, 16637.84862, 15807.00809, 15787.055264, 16646.368070000004, 15422.104810000004, 16776.087270000004, 16518.539460000004, 16536.24759, 17053.044540000003, 16993.828380000003]
The next code draws the corresponding distribution for the sampling mean.
from matplotlib import pyplot as plt
plt.hist(list_means_samples, 8)
plt.show()
The next code employs the Shapiro-Wilk test to verify if the sampling distribution follows a normal distribution.
import matplotlib.pyplot as plt
from scipy.stats import shapiro
stat,p = shapiro(list_means_samples)
print("The Test-Statistic and p-value are as follows:\nTest-Statistic = %.3f , p-value = %.3f"%(stat,p))
The Test-Statistic and p-value are as follows:
Test-Statistic = 0.983 , p-value = 0.902
Remembering the rule to decide about P-value is:
High P-values: Your sample results are consistent with a true null hypothesis.
Low P-values: Your sample results are not consistent with a null hypothesis.
Also, remember that the Shapiro-Wilk test is used to calculate whether a random sample of data comes from a normal distribution which is a common assumption used in many statistical tests [1]. This means the following hypotheses will formulated [2]:
Ho = The sample comes from a normal distribution.
Ha = The sample is not coming from a normal distribution.
So, it can concluded that sampling distribution follows a normal distribution.
The complete code is available in the following link:
https://colab.research.google.com/drive/1x59Luhf0j8H1l4BjcqzdmiclSsz0nZrz?usp=sharing