Extra Activity 5
Use the dataset for applying Central Limit Theore (see Google Colab Link and detailed Analysis below)
Use the dataset for applying Central Limit Theore (see Google Colab Link and detailed Analysis below)
Here is my google colab
(pl click on the link; google colab python code is given therein)
Screenshot is attached above, analysis below
The population consists of 10,000 height measurements with:
Population Mean (μ) = 66.37 inches
Population Standard Deviation (σ) = 3.85 inches
As seen in Figure above, the population distribution is approximately bell-shaped and symmetric around the mean of 66.37 inches, ranging from roughly 55 to 80 inches. This near-normal shape means the CLT is expected to activate even at small sample sizes, which is confirmed in subsequent analyses.
Case I – Sampling Distribution of Sample Means
The theory is as below
Convergence of the Mean: Across all sample sizes, the empirical mean closely tracks the theoretical population mean of 66.37. The largest deviation is only 0.07 inches (at n = 5), confirming that the sample mean is an unbiased estimator of the population mean regardless of sample size.
Shrinking Standard Error: The empirical SE decreases as n increases (from 1.6769 at n = 5 down to 0.3943 at n = 100), almost perfectly matching the theoretical SE = σ/√n at every step. This confirms the 1/√n relationship: quadrupling the sample size halves the standard error.
Shape Convergence (Figure below): At n = 5, the standardised histogram is already roughly bell-shaped but shows slight irregularity. By n = 10 and n = 30, the histogram aligns very closely with the N(0,1) curve (red). At n = 50 and n = 100, the fit is nearly perfect. The Shapiro-Wilk p-values are all well above 0.05 for every n, confirming normality even at n = 5 (p = 0.2299) , attributable to the population already being near-normal.
Q-Q Plots (Figure below, top row): The Q-Q plots for n = 5, 30, and 100 all show points lying tightly along the 45° reference line, with only minor deviations in the tails at n = 5. By n = 100, the alignment is almost perfect, providing strong visual evidence of normality.
Case II – Sampling Distribution of Sample Sums
Convergence of the Sum Mean: The empirical sum mean closely matches the theoretical nμ at every n. For instance, at n = 100, the empirical sum mean is 6634.74 vs. the theoretical 6636.76 (a difference of less than 0.03%), demonstrating the accuracy of the CLT prediction.
Growing Sum Standard Deviation: Unlike Case I (where SE shrinks), the raw sum standard deviation grows with n following the σ√n rule. At n = 5, it is 8.62; at n = 100, it reaches 38.36, closely matching the theoretical values. After standardisation by dividing by σ√n, the distribution collapses to N(0,1) in all cases.
Shape Convergence (Figure below): At n = 5 and n = 10, the standardised sum distributions show a slightly wider spread compared to the N(0,1) curve, with some asymmetry. By n = 30, the fit improves substantially (SW p = 0.1033, still > 0.05). At n = 50 and n = 100, the Shapiro-Wilk p-values jump to 0.87, indicating near-perfect normality. The histograms align tightly with the red N(0,1) curve.
Q-Q Plots (Figure above, bottom row): The Q-Q plots for Case II mirror Case I. At n = 5 there is a slight S-curve deviation in the tails, suggesting mild non-normality at the extremes. By n = 30 and n = 100, the points track the diagonal reference line almost perfectly, confirming convergence to normality.
The SE convergence plot for Case I shows that both empirical and theoretical SE follow the same decreasing curve. The two lines are nearly indistinguishable across all sample sizes, validating the theoretical formula σ/√n empirically. The rate of decrease is steep between n = 5 and n = 30, and flattens after n = 50, illustrating the law of diminishing returns in precision gained by increasing sample size.
Both Case I (sample means) and Case II (sample sums) strongly validate the Central Limit Theorem. Even at n = 5, the near-normal population ensures the sampling distribution is approximately normal. By n = 30, the approximation is excellent, and by n = 100, the empirical distributions are statistically indistinguishable from the standard normal distribution across all tests applied.