Extra Activity 5

Use the dataset for applying Central Limit Theore (see Google Colab Link and detailed Analysis below)

Here is my google colab

(pl click on the link; google colab python code is given therein)

Screenshot is attached above, analysis below

Central Limit Theorem Applied to Weight Height Data: Analysis Report

Dataset: Weight-Height.csv | Variable: Height (inches) | N = 10,000

Population Statistics & Distribution

The population consists of 10,000 height measurements with:

Population Mean (μ) = 66.37 inches
Population Standard Deviation (σ) = 3.85 inches

As seen in Figure above, the population distribution is approximately bell-shaped and symmetric around the mean of 66.37 inches, ranging from roughly 55 to 80 inches. This near-normal shape means the CLT is expected to activate even at small sample sizes, which is confirmed in subsequent analyses.

Case I – Sampling Distribution of Sample Means

The theory is as below

Observations

Convergence of the Mean: Across all sample sizes, the empirical mean closely tracks the theoretical population mean of 66.37. The largest deviation is only 0.07 inches (at n = 5), confirming that the sample mean is an unbiased estimator of the population mean regardless of sample size.

Shrinking Standard Error: The empirical SE decreases as n increases (from 1.6769 at n = 5 down to 0.3943 at n = 100), almost perfectly matching the theoretical SE = σ/√n at every step. This confirms the 1/√n relationship: quadrupling the sample size halves the standard error.

Shape Convergence (Figure below): At n = 5, the standardised histogram is already roughly bell-shaped but shows slight irregularity. By n = 10 and n = 30, the histogram aligns very closely with the N(0,1) curve (red). At n = 50 and n = 100, the fit is nearly perfect. The Shapiro-Wilk p-values are all well above 0.05 for every n, confirming normality even at n = 5 (p = 0.2299) , attributable to the population already being near-normal.

Q-Q Plots (Figure below, top row): The Q-Q plots for n = 5, 30, and 100 all show points lying tightly along the 45° reference line, with only minor deviations in the tails at n = 5. By n = 100, the alignment is almost perfect, providing strong visual evidence of normality.

Case II – Sampling Distribution of Sample Sums

Observations

Convergence of the Sum Mean: The empirical sum mean closely matches the theoretical nμ at every n. For instance, at n = 100, the empirical sum mean is 6634.74 vs. the theoretical 6636.76 (a difference of less than 0.03%), demonstrating the accuracy of the CLT prediction.

Growing Sum Standard Deviation: Unlike Case I (where SE shrinks), the raw sum standard deviation grows with n following the σ√n rule. At n = 5, it is 8.62; at n = 100, it reaches 38.36, closely matching the theoretical values. After standardisation by dividing by σ√n, the distribution collapses to N(0,1) in all cases.

Shape Convergence (Figure below): At n = 5 and n = 10, the standardised sum distributions show a slightly wider spread compared to the N(0,1) curve, with some asymmetry. By n = 30, the fit improves substantially (SW p = 0.1033, still > 0.05). At n = 50 and n = 100, the Shapiro-Wilk p-values jump to 0.87, indicating near-perfect normality. The histograms align tightly with the red N(0,1) curve.

Q-Q Plots (Figure above, bottom row): The Q-Q plots for Case II mirror Case I. At n = 5 there is a slight S-curve deviation in the tails, suggesting mild non-normality at the extremes. By n = 30 and n = 100, the points track the diagonal reference line almost perfectly, confirming convergence to normality.

Standard Error Convergence (Figure below)

The SE convergence plot for Case I shows that both empirical and theoretical SE follow the same decreasing curve. The two lines are nearly indistinguishable across all sample sizes, validating the theoretical formula σ/√n empirically. The rate of decrease is steep between n = 5 and n = 30, and flattens after n = 50, illustrating the law of diminishing returns in precision gained by increasing sample size.

Final Comparision table given below

Summary:

Both Case I (sample means) and Case II (sample sums) strongly validate the Central Limit Theorem. Even at n = 5, the near-normal population ensures the sampling distribution is approximately normal. By n = 30, the approximation is excellent, and by n = 100, the empirical distributions are statistically indistinguishable from the standard normal distribution across all tests applied.

Page updated

Google Sites

Report abuse