1. Concepts & Definitions
1.2. Central Limit Theorem (CLT)
1.5. Confidence interval and normal distribution
1.6. Applying normal confidence interval
1.7. Normal versus Student's T distributions
1.8. Confidence interval and Student T distribution
1.9. Applying Student T confidence interval
1.10. Estimating sample size using normal distribution
1.11. Estimating sample size using Student T distribution
1.12. Estimating proportion using samples
2. Problem & Solution
2.1. Confidence interval for weight of HS6 code
For a stated value of the population parameter, if we collect some sample data points from the same population and calculate the statistic value for the same measure, the difference between the stated value and the calculated sample value is called “Sampling Error” [1].
Sampling Error= population parameter-sample statistic
This introduces the different types of estimates and their characteristics. The “point estimate” says that the mean of the population is a certain value. This is something we need to look closely at because unless we take the whole population and calculate the “mean”, we would be getting a certain value as the “mean”. Since we are estimating it through a sample of the population, it is bound to have errors in either direction. This is the drawback of “point estimation”.
[1] https://towardsai.net/p/data-science/inferential-statistics-for-data-science-explained
To overcome this, we say give a range on the negative(Lower limit) and the positive side(Upper limit) of the point estimate according to the error magnitude and say that the population “mean” can lie in between the “lower limit” and the “upper limit”. This is called “interval estimation”.
Let's define the standard deviation of the sample means as the standard error (SE). In the sampling distribution, says that roughly 95% of random samples will have a sample mean that is within 2SE of the population mean. And so the unknown population mean is also going to be within 2SE. So, the 95% confidence interval of the population mean is approximately the sample’s mean +/- 2SE .
This leads us to make it more interesting by assigning a probability value to the interval saying that I am 95% confident that the population mean falls within the range. This interval estimation after assigning a value to it becomes a “Confidence Interval” estimation.
The next figure help to understand what this means in terms of the statistical distribution of values since a normal distribution could be related to the sample values. To make this assumption, the sample size should have at least 30 values.
A consequence of using a confidence interval of 95% is that there are 5% of values are outside the interval which means:
2.5% of the values are lower than the lower bound of the confidence interval.
2.5% of the values are greater than the upper bound of the confidence interval.
The sum of both percentages is called the significance level (α = alpha). Common significance levels include α =0.05 and α = 0.01. The relationship between the confidence level and the significance level is expressed as:
Confidence level = 1 - Significance level (α).
In other words, the confidence level equals one minus the significance level. For example, if our significance level is 0.05, this means that there is a 5% probability of rejecting the null hypothesis when it is true. The corresponding confidence level would be 1 - 0.05 = 0.95 or 95%.
This brings a new question on how to determine the critical value z α/2 that will be employed as an upper and lower bound to the confidence interval. For this purpose, the inverse of the value of the standard normal distribution is useful since it answers how should be the value of critical z α/2 to cover a certain percentage of the population. The next figure illustrates this aspect.
The next table helps to understand the relation between confidence level, alpha (α), and the critical value z α/2.
The range of the values from the point estimate on either side to the error magnitude is called the “ Margin of Error”. It gives information as to how far the error is located on either side of the point estimate.
where: μ is the mean of the population, x̄ just stands for the “sample mean”, z critical value providing the area of α/2 of the upper tail of the normal distribution, σ is the standard deviation of the population (use sample standard deviation s if population standard deviation is unknown), n is the sample size.
Additional references:
[1] https://medium.com/omarelgabrys-blog/statistics-probability-the-clt-ci-3316a0bae5e6
[2] https://medium.com/analytics-vidhya/verifying-central-limit-theorem-using-python-f57cf4691e8c
[3] https://www.scribbr.com/statistics/central-limit-theorem/
[4] https://medium.com/omarelgabrys-blog/statistics-probability-the-clt-ci-3316a0bae5e6
[5] https://www.calculators.org/math/z-critical-value.php
[6] https://www.statisticshowto.com/probability-and-statistics/find-critical-values/z-alpha2-za2/
[7] https://towardsai.net/p/data-science/inferential-statistics-for-data-science-explained