Statistics (L9–20)

Notes

L9-11 – Central tendency

Some single numbers (or ranges/bins) that can describe a whole dataset: Mode, median, mean (average).

Mode

Value where the frequency is highest/most common value (tallest bar in a histogram or bar chart).
The mode of different samples of a population can often differ.
If all values appear exactly once, there is no mode.

Median

The value in the middle of an ordered list of data.
With an even number of items, the median is the average of the two middle numbers.
Better statistic than the average when dealing with highly skewed distributions.

Mean

The average. Notation: x̄ = Σx / n (for a sample) μ = Σx / N (for the population).
The mean of a sample can be a good indicator for the mean of the whole population.
Samples from the same population will often have similar means.
Mean can be misleading when we have outliers.

Uniform distribution: Dataset without no modes. Like a histogram with no "peaks".

Bi-modal distribution: Some distributions have multiple modes. Two modes would be a bi-modal distribution, and can occur when a dataset has two local "peaks" in a histogram.

Normal distribution: Bell-shaped curve or histogram, that is symmetrical around the middle point.

Skewed distribution: With higher frequencies either to the left (positively skewed) or right (negatively skewed).

Outliers

Values that differ a lot from the others.
Creates skewed distributions by pulling the mean away from the center of the data, making the mean less representative of the data.
Funny example: Average salary of geography majors when Michael Jordan graduated as a geography major.

Robust statistic: Strong and sturdy. For instance when the tendency of the median does affected much by deviations from the norm.

L12-13 – Variability

Spread out: Less concentrated data, uses a bigger range on the X-axis.

Consistent: More concentrated data, uses a smaller range on the X-axis.

Range:

Maximum value - minimum value. Measurement of how spread out the data is.
Changes when extreme values are added, like outliers.

Quartiles: Quarters of the data, for instance the first 25% of the data is the first quartile.

Q1: Median of the first half of data (between first and second quartile)

Q2: Median

Q3: Median of the second half of data (between third and fourth quartile)

Q1-Q3 splits the data into four equal parts.

Interquartile Range (IQR) = Q3-Q1

Not necessarily affected by every value in the range, like the middle values or the outliers

Outliers definition: < Q1 - 1.5 (IQR) or > Q3 + 1.5 (IQR)

Boxplots: Used to visualize data's quartiles and outliers. Box for the IQR, lines for min-max (outlier limit) values, dots for outliers.

Average deviation: Sum of xi - x̄ /x. Distance from mean is a good measurement for variability. Should use the absolute value to measure distance, and ignore negatives.

Absolute deviation: |xi-x̄|, ignoring negatives.

SS: Sum of Squares. Sum of squared deviations.

Variance: Average (mean) of squared deviations: sum(xi-x̄)²

Standard deviation:

Sigma (𝜎). The most common measure of spread.
Square root of the variance. Or average of absolute deviations.
Approx. 68% of the data in a normal distribution falls within ±1𝜎
Approx. 95% of the data in a normal distribution falls within ±2𝜎

Bessel's correction

Variability in samples are often lower than the variability in the population, so we devide by n-1 instead of n, to try to correct this.

Sample standard deviation (s) is an approximate for the standard deviation for the population: s ≈ 𝜎

L14-16 – Standardizing

z, or z-score: Number of standard deviations away from the mean. z = (x-μ)/𝜎

Standardizing a distrubution: Converting any value in a normal distribution to a z-score.

Standard normal distribution:

Has a mean of 0.
Has a 𝜎 (standard deviation) of 1, since, per the z-score formula, if x = 𝜎, then z = (𝜎-μ)/𝜎 = (𝜎-0)/𝜎 = 𝜎/𝜎 = 1

L17-18 – Normal distribution

PDF - Probability Density Function. The area inside the graphs we draw from our distributions. Area determines probability.

Horizontal asymptote: The graph for the normal distribution never touches the x-axis, since we are never sure of whether there are outliers far out. So the axis is called an horizontal asymptote. So the axis goes from negative infinity to positive infinity :)

Probability in a normal distribution:

68% between mean and ±1𝜎
95% between mean and ±2𝜎
So 5% are under -2𝜎 or over 2𝜎

Z-table

A table to look areas under the standard normal curve. See the table. We look up the Z in the table to find the probability score. We find the

L19 – Sampling distributions

SE = Standard Error, or Standard deviation of sampling distribution

Central Limit Theorem

The distribution of sample means is approximately normal
The standard deviation of the sample means ≈ 𝜎/sqrt(n)
The mean of the sample means ≈ μ (mean of the population)

𝜎/SE = sqrt(n)
population standard deviation / standard deviation of distribution of sample means (sampling distribution) = the square root of the sample size
so if we reorganize: SE = 𝜎/sqrt(n)
This is called the Central Limit Theorem
We need a sample size > 1 to get a SE ≠ 𝜎, since sqrt(1) = 1.
as the sample size increases, the standard error decreases

We describe the location of the sample mean by calculating how many standard errors it is away from the center of the sampling distribution. This will give us a z-score for our sample mean.

Z-score = (X-bar - M) / SE

Page updated

Google Sites

Report abuse