- Scatter plot: a set of (x,y) points graphed to show a relationship between x and y
- Strong association: the data points form a clear pattern
- Weak association: the data points do not form a clear pattern
- Outlier: a point that is far away from the pattern of the other data points*
- Explanatory variable / independent variable: the variable that causes the other variable to change; it is plotted on the x-axis
- Response variable / dependent variable: the variable that reacts to the explanatory variable; it is plotted on the y-axis
- Lurking variable: a variable that is not displayed but explains (causes) both of the displayed variables
- Correlation: linear association (how well the data points form a straight line)
- Correlation coefficient: (r) a measure of how well the data points form a straight line; its value is between -1 and 1, where -1 is a perfectly straight line with a negative slope, 0 means there is no correlation, and 1 is a perfectly straight line with a positive slope
- Least-squares regression line / best-fit line: a straight line that best matches the pattern of the data (technically, the line where the sum of the squares of all the residuals is smallest, which is why it is called a "least-squares" regression line)
- Percent explained: the square of the correlation coefficient (r2) -- when converted to a percent, it tells you how much x explains the variation in y
- Residual: the error in the best-fit line's prediction of y (actual y-value - predicted y-value)
- Interpolation: using a best-fit line to predict a y value given an x-value that falls between the min and max x-values (within the min and max x-values)
- Extrapolation: using a best-fit line to predict a y value given an x-value that falls outside the min and max x-values (external from the min and max x-values)
For each table below (15 and 16), do the following:
- m) Find a 95% confidence interval of the slope.
- n) Explain the interval in a clear sentence.
- o) Write out your null and alternative hypotheses for a hypothesis test that the slope is not 0.
- p) Perform the hypothesis test. What is your p-value?
- q) Based on this test, do you conclude that the variables are independent or dependent? How does this compare to your earlier guess. If it is different, why?
15. Use the chart below to compare the age at which someone quit smoking to their cumulative risk of lung cancer.
16. Use the chart below to compare the current blood alcohol content of different people who started with a BAC of of 0.12 different numbers of hours ago.
17. Percentage of workers in a union over time (see sub-questions below)
- a) age when smoking stopped (x-axis) affects the risk of lung cancer (y-axis)
b) do what you see below, but do it by hand.
Age when 75-year-olds stopped smoking vs. percent with lung cancer
d) yes there is a relationship -- it looks exponential/logarithmic; r ≈ 0.9 (the points still form a pretty straight line)
e) r = 0.896, r2 = 0.803
r is a measure of how well the data form a straight line; in this problem, the data forms a fairly straight line with a positive slope
r2 is how much x explains the variation in y; in this problem: the age at which smoking ceased explains 80% of the variation in cumulative risk of lung cancer
f) y = 0.212x - 2.955
g) To do the drawing, calculate endpoints of a line:
for x=0: y=0.212(0)-2.955=-2.955
for x=75: y=0.212(75)-2.955=12.945
SO plot the points (0,-2.955) and (75,12.945) and draw a line through them.
h) Each additional year that someone keeps smoking is linked to an increased cumulative risk of lung cancer of 0.212%.
i) If a person quit smoking at age 0 (i.e. they never started smoking), they would have a cumulative risk of lung cancer of -2.955%. This does not make sense (0% risk is the lowest you can go), but it is NOT extrapolation. Our lowest x-value is an age of 0, so it is interpolation and it has a small residual.
j) Predicted: y=0.212(30)-2.955=3.405, Actual: 1.1
Residual: 1.1 - 3.405 = -2.305 (note that it is negative...residuals do that sometimes)
k) It is an outlier because it is a ways off of the pattern (especially the trend line). It is not super influential because it falls nearby other data points, not off in space alone.
l) It is not an outlier -- it is almost on the trend line. However, it has a lot of influence over the trend line.
m) 95% interval is approximately .084 to .395
n) We're 95% confident that an increase in one year of smoking is linked to an increase of .084% to .395% risk of lung cancer
o) Null: slope = 0, Alt: slope ≠ 0. You could also say Null: the variables are independent, Alt: the variables are dependent
p) 0.000 or something close. That means: "assuming the age you quit smoking and lung cancer risk are independent, there is a 0% chance of finding a slope as extreme as I did". It feels weird to say there is no chance, but that's just because we have only 3 decimal places (maybe there is a 0.0004 probability, for example).
q) Clearly dependent. We just found there is almost no chance of finding our slope when we assume independence, so we reject that assumption and decide it is dependent.
- a) time since drinking ceased (x-axis) affects BAC
c) do it!
d) yes -- linear, perhaps r ≈ -0.9 (the correlation is strong and the slope is negative)
e) r = -0.94, r2 = 0.88
r2 is how much x explains the variation in y; in this problem: the time since you stopped drinking explains 88% of the variation of BAC
(contrary to popular belief, this relationship is in fact very linear -- a person cannot speed up a decrease in BAC; many people drive under the influence as a result)
f) y = -0.032x +0.114
h) For every additional hour after drinking stops, BAC reduces by 0.032.
i) When someone has stopped drinking for 0 hours (the moment they stop), their BAC is 0.114. This is close, but not exactly what we know to be true (since the study had people start with exactly 0.12 BAC)
m) -0.042 to -0.025
n) We are 95% confident that a person's BAC reduces each hour by .042 to .025.
o) Null = 0, Alt does not equal 0
q) Based on the p-value, it is clear that they are dependent. The p-value is 0, making it possible for us to reject the null and assume they are dependent on each other.
a. y= -0.406x+825.384
b. slope= -0.406, This means that each year, about a half of a percent less of the population is in a union.
c. The intercept is 825.384. This means that around the birth of Christ, 825.384% of the population was in a union. Since that doesn’t make a lot of sense, all the y-intercept does in this problem is make sure that the percent for values in the 1900s make sense.
d. Yes – there is a clear linear relationship.
e. r = -0.980. This is very good.
f. r2 = .960. Thus, the year explains 96.0% of the variation of what percent of workers are in a union.
g. (-.406)(1990) + 825.384 = 17.4%
h. (-.406)(2012) + 825.384 = 8.51%. WikiAnswers said 9.4%, so 8.51% is not a bad guess. Note that linear predictions of things like percentages tend to under-estimate when they get close to 0%.
i. (-.406)(2035) + 825.384 = -0.83%. Since you cannot have a negative percent of the population in unions, this estimate is obviously too low. However, the actual answer will probably approach 0%.
Table for #15: http://info.cancerresearchuk.org/cancerstats/causes/lifestyle/tobacco/#Lung
Parts of problems 5-14: Dean, S., & Illowsky, B. 2011. Linear Regression and Correlation: Homework. Connexions, August 11, 2011. http://cnx.org/content/m17085/1.12/.
Powerful tool to play with regression: