A random variable (RV) is a function that assigns a numerical value to each outcome in a sample space of a random experiment. Random variables can be of two types:
Discrete Random Variable: Takes on a countable number of distinct values.
Example: The number of heads in 10 coin tosses (can be 0, 1, 2, ..., 10).
Continuous Random Variable: Takes on an uncountable number of values, typically within a range.
Example: The height of students in a class (can be any value within a range, like 150.5 cm, 151.7 cm, etc.).
Discrete RVs are described by a probability mass function (PMF), which gives the probability that a discrete random variable is exactly equal to some value.
Example: For a binomial random variable X, which represents the number of successes in n trials, the PMF is:
P(X=k)=(nk)pk(1−p)n−k
Continuous RVs are described by a probability density function (PDF), which represents the probability that the variable falls within a certain range of values.
Example: For a normal random variable X, the PDF is:
f(x)=1σ2πexp(−(x−μ)22σ2) where
μ
is the mean and
σ
is the standard deviation.
Expectation (Mean): The average value of a random variable over an infinite number of experiments.
Variance: A measure of the spread of the distribution.
An Estimator is a rule or function that provides an estimate of a population parameter (like the mean or variance) based on sample data. Estimators can be functions of random variables and are often used in inferential statistics to make predictions or inferences about the population from a sample.
Example: The sample mean
X¯
is an estimator of the population mean
μ
.
A Statistic is any quantity calculated from a sample of data. It is a numerical value that describes or summarizes some characteristic of the sample, such as the mean, median, or variance. A statistic is used to estimate population parameters and make inferences about the population based on the sample.
The sampling distribution of a statistic (like the sample mean or variance) is the distribution of that statistic over many samples drawn from the same population.
The Central Limit Theorem states that, regardless of the population distribution, the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, provided n is sufficiently large.
The standard error of a statistic measures the spread of its sampling distribution and is a crucial component in inferential statistics.
Quantifying Uncertainty: A significant part of data science is about making predictions. Since these predictions are derived from sample data, they too are statistics and have their own sampling distributions. Understanding these distributions allows us to model uncertainty. For instance, we can calculate confidence intervals for our predictions of y, giving us a sense of how much our predictions might vary in repeated samples.
Properties of Estimators: Many predictions are based on estimators (e.g., estimating the slope and intercept in linear regression or the probability of a class in classification). Sampling distributions are essential in assessing the properties of these estimators, such as whether they are unbiased or have a low mean squared error (MSE). This insight helps ensure that our models are reliable and robust.