Correlation measures the strength and direction of the linear relationship between two variables. It is a number between -1 and 1. A correlation of 1 indicates a perfect positive relationship, -1 a perfect negative relationship, and 0 means no linear relationship. Correlation doesn't imply causation.
Introduce key terms:
Positive correlation: As one variable increases, the other increases.
Negative correlation: As one variable increases, the other decreases.
No correlation: No consistent relationship between the variables.
The correlation coefficient (r): Values range from -1 to +1, indicating strength and direction.
Regression, on the other hand, involves modeling the relationship between a dependent variable and one or more independent variables. It estimates how the dependent variable changes as the independent variable(s) change. The most common form is linear regression, which fits a line to the data, similar to understanding the direction of correlation but with a focus on predicting one variable based on the other.
Pearson/Spearman correlation is used for continuous (Pearson) or ordinal (Spearman) data with 2 variables (e.g., height and weight). It measures the strength and direction of the relationship but doesn't imply causation.
Chi-square is used for categorical data (nominal or ordinal) with 2 categorical variables (e.g., gender and voting preference), testing for association without indicating direction.
Regression models the relationship between a dependent variable and one or more independent variables, predicting the outcome and estimating the strength and direction of the relationship. Unlike correlation, it shows causality potential and is used for prediction.
Catagorical Data
Tests for association / independence
Strength No
Direction No
Predicts Outcome No
Independ. Variables 2
Continuous
Measures strength and direction
Strength Yes
Direction Yes
Predicts Outcome No
Independ. Variables 1
Depend. variables 1
Ordinal or Continuous
Measures monotonic relationship
Strength Yes
Direction Yes
Predicts Outcome No
Independ. Variables 1
Depend. variables 1
Continuous (and sometimes categorical*)
Models relationships; predicts an outcome
Strength Yes
Direction Yes
Predicts Outcome Yes
Independ. Variables 1 or more
Depend. variables 1
Chi-Square Statistic (χ²): Tells you if the weather (sunny or rainy) affects lemonade sales (high or low).
p-value: Tells you if the relationship between weather and sales is statistically significant (e.g., a p-value of 0.03 means the association is likely not due to chance).
Contingency Table: Shows the number of high and low sales for sunny vs. rainy days.
Effect Size (Cramér’s V or Phi φ): Tells you how strong the relationship is between weather and sales (small, medium, or large effect).
Pearson/Spearman Correlation (r): Measures how price and sales are related.
+1 means as price goes up, sales always go up.
-1 means as price goes up, sales always go down.
0 means no relationship.
Regression Coefficients (β): Shows how much sales change when you change the price of lemonade.
Intercept (β₀): The expected sales when the price is zero.
R²: Tells you how much of the variation in sales can be explained by price changes.
p-values: Tests if price significantly affects sales.
Regression helps you understand how multiple factors (predictors) affect one outcome (dependent variable). In the case of selling lemonade, the dependent variable could be sales, and the independent variables (predictors) could include price, temperature, and advertising effort.
Description: Linear regression models the relationship between two variables with a straight line.
Example: If price increases by £1, sales decrease by 50 lemons. This suggests a linear relationship: for each £1 increase in price, you lose 50 lemons in sales.
Key Idea: It shows a constant rate of change between the independent and dependent variables (price and sales).
Description: Multiple regression models the relationship between two or more independent variables and a dependent variable.
Example: Predicting lemonade sales based on price and marketing spend. The model could show how each factor (price and marketing spend) influences sales.
Key Idea: It helps you understand how multiple factors work together to affect the outcome (sales).
Description: Logistic regression is used when the dependent variable is binary (e.g., yes/no or high/low).
Example: Predicting if sales are above or below a threshold (e.g., whether you will sell more than 100 lemons) based on factors like price and temperature.
Key Idea: It predicts the probability of one outcome or the other (like a yes/no decision).
Description: Logarithmic regression models a relationship where the rate of change decreases over time (i.e., diminishing returns).
Example: If the price of lemonade increases, initially sales drop sharply, but as price continues to rise, the effect on sales becomes smaller (diminishing returns).
Key Idea: It’s useful when the effect of one variable (like price) has a large impact at first, but that impact tapers off as the independent variable increases.
In summary, regression models help predict sales based on various factors and show the strength and direction of these relationships, whether linear or non-linear, and even when the outcome is binary.