 Scatter plot: a set of (x,y) points graphed to show a relationship between x and y
 Strong association: the data points form a clear pattern
 Weak association: the data points do not form a clear pattern
 Outlier: a point that is far away from the pattern of the other data points*
 Explanatory variable / independent variable: the variable that causes the other variable to change; it is plotted on the xaxis
 Response variable / dependent variable: the variable that reacts to the explanatory variable; it is plotted on the yaxis
 Lurking variable: a variable that is not displayed but explains (causes) both of the displayed variables
 Correlation: linear association (how well the data points form a straight line)
 Correlation coefficient: (r) a measure of how well the data points form a straight line; its value is between 1 and 1, where 1 is a perfectly straight line with a negative slope, 0 means there is no correlation, and 1 is a perfectly straight line with a positive slope
 Leastsquares regression line / bestfit line: a straight line that best matches the pattern of the data (technically, the line where the sum of the squares of all the residuals is smallest, which is why it is called a "leastsquares" regression line)
 Percent explained: the square of the correlation coefficient (r2)  when converted to a percent, it tells you how much x explains the variation in y
 Residual: the error in the bestfit line's prediction of y (actual yvalue  predicted yvalue)
 Interpolation: using a bestfit line to predict a y value given an xvalue that falls between the min and max xvalues (within the min and max xvalues)
 Extrapolation: using a bestfit line to predict a y value given an xvalue that falls outside the min and max xvalues (external from the min and max xvalues)
For each table below (15 and 16), do the following:  m) Find a 95% confidence interval of the slope.
 n) Explain the interval in a clear sentence.
 o) Write out your null and alternative hypotheses for a hypothesis test that the slope is not 0.
 p) Perform the hypothesis test. What is your pvalue?
 q) Based on this test, do you conclude that the variables are independent or dependent? How does this compare to your earlier guess. If it is different, why?
15. Use the chart below to compare the age at which someone quit smoking to their cumulative risk of lung cancer.
Age, CumRisk 0, 0.2 30, 1.1 40, 2.6 50, 5.6 60, 11.1 75, 15.7
16. Use the chart below to compare the current blood alcohol content of different people who started with a BAC of of 0.12 different numbers of hours ago.
Hrs, BAC 2,0.041 1,0.093 1,0.083 3,0.017 3,0.006 2,0.023 1,0.092 2,0.044 3,0.018 4,0.002 1,0.085 2,0.056
17. Percentage of workers in a union over time (see subquestions below)
Year Percent 1945 35.5 1950 31.5 1960 31.4 1970 27.3 1980 21.9 1986 17.5 1993 15.8
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 a) age when smoking stopped (xaxis) affects the risk of lung cancer (yaxis)
b) do what you see below, but do it by hand. Age when 75yearolds stopped smoking vs. percent with lung cancer c) d) yes there is a relationship  it looks exponential/logarithmic; r ≈ 0.9 (the points still form a pretty straight line) e) r = 0.896, r^{2} = 0.803 r is a measure of how well the data form a straight line; in this problem, the data forms a fairly straight line with a positive slope r^{2} is how much x explains the variation in y; in this problem: the age at which smoking ceased explains 80% of the variation in cumulative risk of lung cancer f) y = 0.212x  2.955 g) To do the drawing, calculate endpoints of a line: for x=0: y=0.212(0)2.955=2.955 for x=75: y=0.212(75)2.955=12.945 SO plot the points (0,2.955) and (75,12.945) and draw a line through them. h) Each additional year that someone keeps smoking is linked to an increased cumulative risk of lung cancer of 0.212%. i) If a person quit smoking at age 0 (i.e. they never started smoking), they would have a cumulative risk of lung cancer of 2.955%. This does not make sense (0% risk is the lowest you can go), but it is NOT extrapolation. Our lowest xvalue is an age of 0, so it is interpolation and it has a small residual. j) Predicted: y=0.212(30)2.955=3.405, Actual: 1.1 Residual: 1.1  3.405 = 2.305 (note that it is negative...residuals do that sometimes) k) It is an outlier because it is a ways off of the pattern (especially the trend line). It is not super influential because it falls nearby other data points, not off in space alone. l) It is not an outlier  it is almost on the trend line. However, it has a lot of influence over the trend line. m) 95% interval is approximately .084 to .395 n) We're 95% confident that an increase in one year of smoking is linked to an increase of .084% to .395% risk of lung cancer o) Null: slope = 0, Alt: slope ≠ 0. You could also say Null: the variables are independent, Alt: the variables are dependent p) 0.000 or something close. That means: "assuming the age you quit smoking and lung cancer risk are independent, there is a 0% chance of finding a slope as extreme as I did". It feels weird to say there is no chance, but that's just because we have only 3 decimal places (maybe there is a 0.0004 probability, for example). q) Clearly dependent. We just found there is almost no chance of finding our slope when we assume independence, so we reject that assumption and decide it is dependent.
 a) time since drinking ceased (xaxis) affects BAC
b) n/a c) do it! d) yes  linear, perhaps r ≈ 0.9 (the correlation is strong and the slope is negative) e) r = 0.94, r^{2} = 0.88 r^{2} is how much x explains the variation in y; in this problem: the time since you stopped drinking explains 88% of the variation of BAC (contrary to popular belief, this relationship is in fact very linear  a person cannot speed up a decrease in BAC; many people drive under the influence as a result) f) y = 0.032x +0.114 g) n/a h) For every additional hour after drinking stops, BAC reduces by 0.032. i) When someone has stopped drinking for 0 hours (the moment they stop), their BAC is 0.114. This is close, but not exactly what we know to be true (since the study had people start with exactly 0.12 BAC) j) n/a k) n/a l) n/a m) 0.042 to 0.025 n) We are 95% confident that a person's BAC reduces each hour by .042 to .025. o) Null = 0, Alt does not equal 0 p) 0.00 q) Based on the pvalue, it is clear that they are dependent. The pvalue is 0, making it possible for us to reject the null and assume they are dependent on each other.

a. y= 0.406x+825.384 b. slope= 0.406, This means that each year, about a half of a percent less of the population is in a union. c. The intercept is 825.384. This means that around the birth of Christ, 825.384% of the population was in a union. Since that doesn’t make a lot of sense, all the yintercept does in this problem is make sure that the percent for values in the 1900s make sense. d. Yes – there is a clear linear relationship. e. r = 0.980. This is very good. f. r^{2} = .960. Thus, the year explains 96.0% of the variation of what percent of workers are in a union. g. (.406)(1990) + 825.384 = 17.4% h. (.406)(2012) + 825.384 = 8.51%. WikiAnswers said 9.4%, so 8.51% is not a bad guess. Note that linear predictions of things like percentages tend to underestimate when they get close to 0%. i. (.406)(2035) + 825.384 = 0.83%. Since you cannot have a negative percent of the population in unions, this estimate is obviously too low. However, the actual answer will probably approach 0%.
Sources Table for #15: http://info.cancerresearchuk.org/cancerstats/causes/lifestyle/tobacco/#Lung Parts of problems 514: Dean, S., & Illowsky, B. 2011. Linear Regression and Correlation: Homework. Connexions, August 11, 2011. http://cnx.org/content/m17085/1.12/.
Powerful tool to play with regression: http://www.shodor.org/interactivate/activities/Regression/
More datasets: http://blog.benwildeboer.com/2014/lineardatasetsforyourenjoyment/ 
Updating...
Ċ Andy Pethan, Apr 28, 2015, 12:10 PM
