Some single numbers (or ranges/bins) that can describe a whole dataset: Mode, median, mean (average).
Mode
Median
Mean
Uniform distribution: Dataset without no modes. Like a histogram with no "peaks".
Bi-modal distribution: Some distributions have multiple modes. Two modes would be a bi-modal distribution, and can occur when a dataset has two local "peaks" in a histogram.
Normal distribution: Bell-shaped curve or histogram, that is symmetrical around the middle point.
Skewed distribution: With higher frequencies either to the left (positively skewed) or right (negatively skewed).
Outliers
Robust statistic: Strong and sturdy. For instance when the tendency of the median does affected much by deviations from the norm.
Spread out: Less concentrated data, uses a bigger range on the X-axis.
Consistent: More concentrated data, uses a smaller range on the X-axis.
Range:
Quartiles: Quarters of the data, for instance the first 25% of the data is the first quartile.
Q1: Median of the first half of data (between first and second quartile)
Q2: Median
Q3: Median of the second half of data (between third and fourth quartile)
Q1-Q3 splits the data into four equal parts.
Interquartile Range (IQR) = Q3-Q1
Outliers definition: < Q1 - 1.5 (IQR) or > Q3 + 1.5 (IQR)
Boxplots: Used to visualize data's quartiles and outliers. Box for the IQR, lines for min-max (outlier limit) values, dots for outliers.
Average deviation: Sum of xi - x̄ /x. Distance from mean is a good measurement for variability. Should use the absolute value to measure distance, and ignore negatives.
Absolute deviation: |xi-x̄|, ignoring negatives.
SS: Sum of Squares. Sum of squared deviations.
Variance: Average (mean) of squared deviations: sum(xi-x̄)²
Standard deviation:
Bessel's correction
Variability in samples are often lower than the variability in the population, so we devide by n-1 instead of n, to try to correct this.
Sample standard deviation (s) is an approximate for the standard deviation for the population: s ≈ 𝜎
z, or z-score: Number of standard deviations away from the mean. z = (x-μ)/𝜎
Standardizing a distrubution: Converting any value in a normal distribution to a z-score.
Standard normal distribution:
PDF - Probability Density Function. The area inside the graphs we draw from our distributions. Area determines probability.
Horizontal asymptote: The graph for the normal distribution never touches the x-axis, since we are never sure of whether there are outliers far out. So the axis is called an horizontal asymptote. So the axis goes from negative infinity to positive infinity :)
Probability in a normal distribution:
Z-table
A table to look areas under the standard normal curve. See the table. We look up the Z in the table to find the probability score. We find the
SE = Standard Error, or Standard deviation of sampling distribution
Central Limit Theorem
We describe the location of the sample mean by calculating how many standard errors it is away from the center of the sampling distribution. This will give us a z-score for our sample mean.
Z-score = (X-bar - M) / SE