Statistics notes for Introduction to Data Science

Introduction to Random Variables and Sampling Distributions

1. Random Variables

Definition:

A random variable (RV) is a function that assigns a numerical value to each outcome in a sample space of a random experiment. Random variables can be of two types:

Discrete Random Variable: Takes on a countable number of distinct values.
- Example: The number of heads in 10 coin tosses (can be 0, 1, 2, ..., 10).
Continuous Random Variable: Takes on an uncountable number of values, typically within a range.
- Example: The height of students in a class (can be any value within a range, like 150.5 cm, 151.7 cm, etc.).

Probability Distribution Functions (PDFs):

Discrete RVs are described by a probability mass function (PMF), which gives the probability that a discrete random variable is exactly equal to some value.
- Example: For a binomial random variable X, which represents the number of successes in n trials, the PMF is:
- P(X=k)=(nk)pk(1−p)n−k
Continuous RVs are described by a probability density function (PDF), which represents the probability that the variable falls within a certain range of values.
- Example: For a normal random variable X, the PDF is:
- f(x)=1σ2πexp⁡(−(x−μ)22σ2) where
- μ
- is the mean and
- σ
- is the standard deviation.

2. Expectation and Variance

Expectation (Mean): The average value of a random variable over an infinite number of experiments.

Variance: A measure of the spread of the distribution.

Note: What is an Estimator?

An Estimator is a rule or function that provides an estimate of a population parameter (like the mean or variance) based on sample data. Estimators can be functions of random variables and are often used in inferential statistics to make predictions or inferences about the population from a sample.

Example: The sample mean
X¯
is an estimator of the population mean
μ
.

3. Sampling Distributions

Note: What is a Statistic?

A Statistic is any quantity calculated from a sample of data. It is a numerical value that describes or summarizes some characteristic of the sample, such as the mean, median, or variance. A statistic is used to estimate population parameters and make inferences about the population based on the sample.

What is a Sampling Distribution?

The sampling distribution of a statistic (like the sample mean or variance) is the distribution of that statistic over many samples drawn from the same population.

Key Concept: The Central Limit Theorem (CLT)

The Central Limit Theorem states that, regardless of the population distribution, the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, provided n is sufficiently large.

Standard Error (SE):

The standard error of a statistic measures the spread of its sampling distribution and is a crucial component in inferential statistics.

4. Why Do We Care?

Quantifying Uncertainty: A significant part of data science is about making predictions. Since these predictions are derived from sample data, they too are statistics and have their own sampling distributions. Understanding these distributions allows us to model uncertainty. For instance, we can calculate confidence intervals for our predictions of y, giving us a sense of how much our predictions might vary in repeated samples.

Properties of Estimators: Many predictions are based on estimators (e.g., estimating the slope and intercept in linear regression or the probability of a class in classification). Sampling distributions are essential in assessing the properties of these estimators, such as whether they are unbiased or have a low mean squared error (MSE). This insight helps ensure that our models are reliable and robust.

Page updated

Google Sites

Report abuse