1. Concepts & Definitions
1.2. Central Limit Theorem (CLT)
1.5. Confidence interval and normal distribution
1.6. Applying normal confidence interval
1.7. Normal versus Student's T distributions
1.8. Confidence interval and Student T distribution
1.9. Applying Student T confidence interval
1.10. Estimating sample size using normal distribution
1.11. Estimating sample size using Student T distribution
1.12. Estimating proportion using samples
2. Problem & Solution
2.1. Confidence interval for weight of HS6 code
Two types of quality inspections are conducted on production lines: 100% inspection and sampling inspection. Inspecting every single part that is produced in massive quantities at low costs, such as nuts and bolts, light bulbs, and electronic components, can dramatically increase labor and costs, making that approach neither economical nor realistic. 100% inspection is also not viable in cases where assessment may break the product. Therefore, 100% inspections are generally adopted for life-supporting products, expensive products, and products that are not consumed or damaged by inspection. This means that a variety of products undergo sampling inspections in different processes [1].
In sampling inspection, samples are taken from a target lot (inspection lot) for examination in order to determine the acceptability of the lot according to that lot’s quality standards. Thanks to the small number of items to be inspected compared to 100% inspection, manufacturers can save on inspection costs and time. Sampling inspection also allows for the examination of a number of test items that cannot otherwise be inspected in 100% inspection [1]. The next Figure helps to illustrate the sampling inspection in a customs context.
Before moving any further, we need to understand two important terminologies that are parameter and statistic [2]:
Parameter: It is a measure that could be mean, median, variance, and many more for population data.
Statistic: It is a measure that could be mean, median, variance, and many more for sample data.
In the figure. the population is on the left, and the sample on the right, obtained after a random sampling, enables a point estimate of the population mean through the sample mean value. A brief explanation of the terminology employed is [3]:
(i) Population: the group of items you want to learn about. It can be people (e.g. people of a country), products (all products sold on amazon.com), or whatever. There’s no real size limit here, but you assume the population is really huge, so you can’t measure all the items and need to take a sample. We’ll use all citizens in the United States as our suitably-huge population.
(ii) Variable of interest: Imagine each item in the population has an associated row of values, like an Excel table. For a population of US citizens, maybe the values you have are: “age”, “blood_type”, etc. You are interested in one particular variable, let’s say the age of US citizens.
(iii) (Random) Sample: a subset of your population, selected at random. “At random” here means something specific: every item of the population should have an equal probability of being selected (prob = 1/population_size) and the items must be selected independently, i.e. the selection of one item should not make the selection of another item more or less probable.
(iv) Sample size: typically we fix the sample size to some value, denoted by variable “N”. Importantly, N is not a random variable; it stays the same regardless of how many samples we take. Additionally, items are selected with replacement, meaning when we take multiple samples, we can select the same items again and again. This is analogous to randomly selecting N cards (from a “population” of 52 cards), replacing them, then drawing another N card, and repeating this many times.
(v) Sampling and biases: Why do we use random samples? As mentioned, it’s not feasible (or at least, not convenient) to gather information about all US citizens, every time you need to run some analysis about their age. Instead, you must take a random subset and use that. For a population as huge as US citizens, random sampling is itself very challenging, and almost every approach you can think of is prone to one or another kind of statistical bias.
Typically, we pick a random-enough sample, which while not totally random, is the best one we have on hand. This is called a “convenience sample”. This is left up to the judgment of the researcher, and a bad researcher might use a convenience sample that is terribly biased.
(vi) Sample statistics vs population statistics: we want to know something about the population, e.g. the average “age” of US citizens. We cannot practically get the age of all US citizens for our experiment, so we take a sample and estimate the average age (since average age is a number, it’s called a point estimate). Here, “average age” when measured on the sample is an example of a “sample statistic”, whereas the actual “average age” of the population is the corresponding “population statistic”. We don’t know the population value (if we did, sampling would be unnecessary), so instead want to ensure the sample statistic value is close to the population value, by reducing bias.
Addditional References
CLT Simulation
Simple CLT application - remember transformation from any normal to standard normal and its application to point and interval estimation:
https://towardsai.net/p/data-science/inferential-statistics-for-data-science-explained