2. Correlation and Regression

Learning objectives (and summaries)

Describe the relationship between two quantitative variables.

Create and read a scatterplot. If there is an explanatory variable, place it on the horizontal axis.
- Dots for each individual placed on an x,y plot. Each axis is a different variable.
Identify linear and nonlinear patterns in a scatterplot.
- Association is finding a pattern in the data. Sometimes the association is linear. Other times the data needs to be transformed (it may be parabolic, exponential, logarithmic, etc.) before it appears linear.
Visually estimate the correlation of scatterplot. Calculate the exact coefficient using StatKey. Know how to interpret correlation coefficient, r, and the percent explained, r².
- -1 ≤ r ≤ 1, r tends towards 0 when there is no correlation, and r tends towards -1 or +1 when there is strong correlation (negative or positive, respectively).
- Taking the square of the correlation coefficient, r², lets you know how much x can predict the variation in y.
Visually identify outliers and influential points and understand their influence on a regression line.
- Outliers are far from the trend line. Sometimes they have a lot of influence on the regression line.
- Influential points fall along the trend line, but have an unusual amount of influence over where the regression line goes.
Visually estimate the location of the regression line on a scatterplot. Calculate the exact line using StatKey. Interpret the slope of a regression line with units.
- It is the line that best fits the data.
- Using StatKey, find the slope and intercept and write out the equation of the line.
- Explain the slope as a measure of change in y for a given change in x. Use units for x and y. For example, "the salary of a student is expected to increase $3000 for each additional year in school".
Predict new values using interpolation and extrapolation with the regression line. Understand each of their limitations.
- Interpolation is predicting values within the range of the data. If the data form a strong association, this is often reliable.
- Extrapolation is predicting values above or below the range of the data. Making predictions into the future is extrapolation. It is far less reliable, especially as you go further out.
Understand logarithmic and power transformations in one or both variables to achieve linearity.
- Logarithmic/power transformations make a geometric variable (increases by continually multiplying) into an arithmetic variable (increases by continually adding). The transformation makes exponential patterns become linear and thus usable for regression. This is easiest to play with in Gapminder.
- Calculate a confidence interval of the slope. Explain how bootstrapping is used to create the interval. Interpret your result in a sentence.
- Use StatKey to resample the original set many times. Observe the different slopes calculated from each of the resamples. Find the middle __% of the calculated slopes to create an interval.
- Template sentence: "We're __% confident that an increase of 1 [x-variable's units] is linked to an average [increase or decrease] of [interval of the slope and y-variable's units] for [the relevant population].
- Example sentence: "We're 95% confident that an increase of 1 year of education is linked to an average increase of $2950 to $3420 salary for Americans.
- Execute a hypothesis test that the slope is not zero. Explain how the randomization test works. Interpret your p-value in a sentence.
- The randomization test mixes up the y values so they don't necessarily correspond to the same x value. This is the same as saying that x and y are independent, and thus have a slope of 0. It creates many sets of data like this and calculates the slopes of the regression lines. The goal is to see how unlikely we are to find a slope as or more extreme as the actual slope we found.
- Template sentence: "Assuming [x-variable's units] and [y-variable's units] are independent for [population], there is a [p-value] probability of finding a slope as extreme as we did."
- Example sentence: "Assuming years of education and salary are independent for Americans, there is a .001 probability of finding a slope as extreme as we did."

Assessment

- Test (18pts): 16 questions (9 MC, 5 numeric, 2 written/drawn) and one of these free response questions (2pts):
- The y-intercept is rarely useful on its own in a regression equation. However, it is still necessary. Why?
  - Imagine that you performed linear regression on an American family’s income and how many cars they own. A hypothesis test that the slope is not zero results in a p-value of 0.03. In a sentence explain exactly what that p-value means (not the conclusion about independence).
  - In more advanced regression with multiple explanatory variables, you may end up with an equation like the following: y=26000+4000x₁+800x₂.
  - y is the projected income, x₁ is the number of years of education after high school, and x₂ is the number of years of experience in a single industry. How much would you expect your income to change if you go back to school for a 2-year master’s degree without any additional years of experience? Explain and show your work.

Gapminder Video: examples and rubric

Instruction

Guided Notes

Vocabulary

Scatter plot: a set of (x,y) points graphed to show a relationship between x and y
Strong association: the data points form a clear pattern
Weak association: the data points do not form a clear pattern
Outlier: a point that is far away from the pattern of the other data points*
Explanatory variable / independent variable: the variable that causes the other variable to change; it is plotted on the x-axis
Response variable / dependent variable: the variable that reacts to the explanatory variable; it is plotted on the y-axis
Lurking variable: a variable that is not displayed but explains (causes) both of the displayed variables
Correlation: linear association (how well the data points form a straight line)
Correlation coefficient: (r) a measure of how well the data points form a straight line; its value is between -1 and 1, where -1 is a perfectly straight line with a negative slope, 0 means there is no correlation, and 1 is a perfectly straight line with a positive slope
Least-squares regression line / best-fit line: a straight line that best matches the pattern of the data (technically, the line where the sum of the squares of all the residuals is smallest, which is why it is called a "least-squares" regression line)
Percent explained: the square of the correlation coefficient (r2) -- when converted to a percent, it tells you how much x explains the variation in y
Residual: the error in the best-fit line's prediction of y (actual y-value - predicted y-value)
Interpolation: using a best-fit line to predict a y value given an x-value that falls between the min and max x-values (within the min and max x-values)
Extrapolation: using a best-fit line to predict a y value given an x-value that falls outside the min and max x-values (external from the min and max x-values)

Practice

For each of the graphs above (figures 1-4):

a. Is there a strong association to this shape?
b. Is the correlation strong or weak? Positive or negative? Guess ‘r’, the correlation coefficient.

Answer the following general questions:

5. If there is an explanatory variable, which axis should it be placed on? What should you do if one variable does not clearly explain the other?
6. Is time a quantitative variable or a categorical variable? Based on this answer, can you make a scatter plot of time and another quantitative variable?
7. Imagine that I have a distribution of 40 heights that I collected in a survey. I also have another distribution of 40 weights that I collected from a different survey. Are both of these quantitative variables? Based on this, can I use this data to create a scatter plot? Explain.
8. What does the value of r2 tell you?

For each scenario below, do the following:

a. What would each dot on this scatter plot represent? Be as specific as possible.
b. Do you think the two variables are associated? Is it a positive or negative association?
c. Of the two quantitative variables, which is the explanatory (independent) variable and which is the response (dependent) variable? Explain why. If neither is the cause, is the association a coincidence or is there a lurking variable? If a lurking variable, explain your theory.

9. The weekly grocery bill is associated with the number of family members.

10. Life insurance companies base their premiums (monthly price) partially on the age of the applicant.

11. Shoe size is highly correlated with reading scores before age 14.

12. A study is done to determine if elderly drivers are involved in more motor vehicle fatalities than all other drivers. The number of fatalities per 100,000 drivers is compared to the age of drivers.

13. On a summer day, the number of people sleeping at a given hour is associated with the temperature outside.

14. Utility bills vary according to power consumption.

For each table below (15 and 16), do the following:

a) Which variable in the table is the likely cause? Which axis should it be plotted on?
b) For #15, create a scatter plot of the data on paper. Remember to accurately label the graph. Don't be a slacker, actually do it.
c) Create a scatter plot in StatKey.
d) Is there a relationship? What is the shape? Take a guess at ‘r’, the correlation coefficient.
- e) Calculate r and r². Interpret r².
f) Find the equation of the line of best fit.
g) For #15, draw this equation on the graph. To do this, use the equation with a low x-value (such as 0) and a high x-value (such as 75) and find the predicted y-values. Use these 2 points to draw your line. Don't just guess.
h) Explain the slope in context (with units).
i) Explain the y-intercept in context. Is this intercept extrapolation? Why or why not?
j) For #15 find the residual when age is 30. What does this tell you?
k) For #15, imagine a new point is added at (45, 2.1). Is this an outlier? Is it an influential point?
l) For #15, imagine a new point is added at (100, 17). Is this an outlier? Is it an influential point?
m) Find a 95% confidence interval of the slope.
n) Explain the interval in a clear sentence.
o) Write out your null and alternative hypotheses for a hypothesis test that the slope is not 0.
p) Perform the hypothesis test. What is your p-value?
q) Based on this test, do you conclude that the variables are independent or dependent? How does this compare to your earlier guess. If it is different, why?

15. Use the chart below to compare the age at which someone quit smoking to their cumulative risk of lung cancer.

Age, CumRisk

0, 0.2

30, 1.1

40, 2.6

50, 5.6

60, 11.1

75, 15.7

16. Use the chart below to compare the current blood alcohol content of different people who started with a BAC of of 0.12 different numbers of hours ago.

Hrs, BAC

2,0.041

1,0.093

1,0.083

3,0.017

3,0.006

2,0.023

1,0.092

2,0.044

3,0.018

4,0.002

1,0.085

2,0.056

17. Percentage of workers in a union over time (see sub-questions below)

Year Percent

1945 35.5

1950 31.5

1960 31.4

1970 27.3

1980 21.9

1986 17.5

1993 15.8

a) Calculate the least squares line. Write down the equation in the form of y=mx+b.
b) Identify the slope of the regression line and explain what it means in this problem.
c) Identify the intercept of the regression line. Does this have any realistic meaning?
d) Does it appear that a line is the best way to fit the data? Why or why not?
e) Find the correlation coefficient. Does this imply that the line is a good fit?
f) How much (what percent) does the size explain the price?
g) Use interpolation to estimate the percent of people in a union in 1990.
h) Use extrapolation to predict the percent of people in a union in 2012. Do you think this is a reasonable prediction?
i) Use extrapolation to predict the percent of people in a union in 2035. Does this prediction make sense?

Practice solutions

1. a) yes, strong association
2. b) medium correlation, positive (it is going up), so r ≈ 0.8

a) yes, strong association
1. b) bad correlation, barely negative (goes down AND up), so r ≈ -0.1 [note: the only reason I wouldn't guess 0 is because of the extra point on the top left]
a) yes, strong association

1. b) very strong correlation, negative (it is going down), so r ≈ -0.99
2. a) yes, strong association
3. b) medium correlation, negative (it is going down), so r ≈ -0.8

Explanatory variables always go on the x-axis. If neither variable is very explanatory of the other, it doesn't matter which axis you place it on.
Time is quantitative -- you can count it numerically. Yes, you can plot time against another number (such as weight at different ages). These are often called time plots.
Both are quantitative variables. However, this data cannot be used for a scatter plot. A scatter plot uses two quantitative variables measured on the SAME individual. Thus, every dot on a scatter plot represents one individual in the study.
r^2 tells you how much x explains (or predicts) the variation in y. Assuming you plotted your axes correctly with your causation, this is a good measure of how well you can predict your response variable when you know your explanatory variable. As r^2 gets bigger, the response variable gets more predictable.
a) every dot represents an individual; in this case, a family
b) yes -- they seem related, and both should go up together (positive association)
c) family size explains the grocery bill because you have more people to feed (and a high grocery bill clearly doesn't cause your family to grow)
a) every dot represents an individual; in this case, a person
b) yes -- older people seem like they should get different rates for life insurance than younger people, and older should mean more expensive (so positive association)
c) age explains premium because, on average, old people live shorter lives than young people
a) every dot represents an individual; in this case, a kid
b) yes, as kids grow they wear bigger shoes and read better (so positive association)
c) the lurking variable is age, because older kids read better and grow their feet!
a) every dot represents a measurement in time comparing age and a current rate; this is not as easy as the other questions because we are plotting the rate, or average number of accidents, for people of a given age, instead of just plotting individuals
b) yes, I think that older people are less likely to see well or drive the right speed and so are more likely to get in accidents
c) age explains driving accidents (see above)
a) every dot represents a new measurement of # of people taken at a different time (when we look at time instead of individuals, it is often called a time plot)
b) yes -- more people sleep at night, and the temperature is usually lower at night, so there is an inverse relationship (negative association)
c) both are a response to time of day, though temperature may be considered more a cause than # of sleepers
a) every dot represents a new monthly bill -- how much it cost and how much power was used

1. b) yes -- more power = larger bill, so positive correlation
2. c) the power company charges you based on how much you use, so usage causes the bill

a) age when smoking stopped (x-axis) affects the risk of lung cancer (y-axis)
b) do what you see below, but do it by hand.
Age when 75-year-olds stopped smoking vs. percent with lung cancer
c)

d) yes there is a relationship -- it looks exponential/logarithmic; r ≈ 0.9 (the points still form a pretty straight line)
1. e) r = 0.896, r² = 0.803
2. r is a measure of how well the data form a straight line; in this problem, the data forms a fairly straight line with a positive slope
3. r² is how much x explains the variation in y; in this problem: the age at which smoking ceased explains 80% of the variation in cumulative risk of lung cancer
4. f) y = 0.212x - 2.955
5. g) To do the drawing, calculate endpoints of a line:
6. for x=0: y=0.212(0)-2.955=-2.955
7. for x=75: y=0.212(75)-2.955=12.945
8. SO plot the points (0,-2.955) and (75,12.945) and draw a line through them.
9. h) Each additional year that someone keeps smoking is linked to an increased cumulative risk of lung cancer of 0.212%.
10. i) If a person quit smoking at age 0 (i.e. they never started smoking), they would have a cumulative risk of lung cancer of -2.955%. This does not make sense (0% risk is the lowest you can go), but it is NOT extrapolation. Our lowest x-value is an age of 0, so it is interpolation and it has a small residual.
11. j) Predicted: y=0.212(30)-2.955=3.405, Actual: 1.1
12. Residual: 1.1 - 3.405 = -2.305 (note that it is negative...residuals do that sometimes)
13. k) It is an outlier because it is a ways off of the pattern (especially the trend line). It is not super influential because it falls nearby other data points, not off in space alone.
14. l) It is not an outlier -- it is almost on the trend line. However, it has a lot of influence over the trend line.
15. m) 95% interval is approximately .084 to .395
16. n) We're 95% confident that an increase in one year of smoking is linked to an increase of .084% to .395% risk of lung cancer
17. o) Null: slope = 0, Alt: slope ≠ 0. You could also say Null: the variables are independent, Alt: the variables are dependent
18. p) 0.000 or something close. That means: "assuming the age you quit smoking and lung cancer risk are independent, there is a 0% chance of finding a slope as extreme as I did". It feels weird to say there is no chance, but that's just because we have only 3 decimal places (maybe there is a 0.0004 probability, for example).
19. q) Clearly dependent. We just found there is almost no chance of finding our slope when we assume independence, so we reject that assumption and decide it is dependent.
20. a) time since drinking ceased (x-axis) affects BAC
21. b) n/a
22. c) do it!
23. d) yes -- linear, perhaps r ≈ -0.9 (the correlation is strong and the slope is negative)
24. e) r = -0.94, r² = 0.88
25. r² is how much x explains the variation in y; in this problem: the time since you stopped drinking explains 88% of the variation of BAC
26. (contrary to popular belief, this relationship is in fact very linear -- a person cannot speed up a decrease in BAC; many people drive under the influence as a result)
27. f) y = -0.032x +0.114
28. g) n/a
29. h) For every additional hour after drinking stops, BAC reduces by 0.032.
30. i) When someone has stopped drinking for 0 hours (the moment they stop), their BAC is 0.114. This is close, but not exactly what we know to be true (since the study had people start with exactly 0.12 BAC)
31. j) n/a
32. k) n/a
33. l) n/a
34. m) -0.042 to -0.025
35. n) We are 95% confident that a person's BAC reduces each hour by .042 to .025.
36. o) Null = 0, Alt does not equal 0
37. p) 0.00
38. q) Based on the p-value, it is clear that they are dependent. The p-value is 0, making it possible for us to reject the null and assume they are dependent on each other.

1. a. y= -0.406x+825.384
2. b. slope= -0.406, This means that each year, about a half of a percent less of the population is in a union.
3. c. The intercept is 825.384. This means that around the birth of Christ, 825.384% of the population was in a union. Since that doesn’t make a lot of sense, all the y-intercept does in this problem is make sure that the percent for values in the 1900s make sense.
4. d. Yes – there is a clear linear relationship.
5. e. r = -0.980. This is very good.
6. f. r² = .960. Thus, the year explains 96.0% of the variation of what percent of workers are in a union.
7. g. (-.406)(1990) + 825.384 = 17.4%
8. h. (-.406)(2012) + 825.384 = 8.51%. WikiAnswers said 9.4%, so 8.51% is not a bad guess. Note that linear predictions of things like percentages tend to under-estimate when they get close to 0%.
9. i. (-.406)(2035) + 825.384 = -0.83%. Since you cannot have a negative percent of the population in unions, this estimate is obviously too low. However, the actual answer will probably approach 0%.

Notes

Sources

Table for #15: http://info.cancerresearchuk.org/cancerstats/causes/lifestyle/tobacco/#Lung

Parts of problems 5-14: Dean, S., & Illowsky, B. 2011. Linear Regression and Correlation: Homework. Connexions, August 11, 2011. http://cnx.org/content/m17085/1.12/.