H. Moneyballer‎ > ‎

1. Correlation and Regression

Mastery Quiz Prep
Vocabulary
      • Scatter plot: a set of (x,y) points graphed to show a relationship between x and y
      • Strong association: the data points form a clear pattern
      • Weak association: the data points do not form a clear pattern
      • Outlier: a point that is far away from the pattern of the other data points*
      • Explanatory variable / independent variable: the variable that causes the other variable to change; it is plotted on the x-axis
      • Response variable / dependent variable: the variable that reacts to the explanatory variable; it is plotted on the y-axis
      • Lurking variable: a variable that is not displayed but explains (causes) both of the displayed variables
      • Correlation: linear association (how well the data points form a straight line)
      • Correlation coefficient: (r) a measure of how well the data points form a straight line; its value is between -1 and 1, where -1 is a perfectly straight line with a negative slope, 0 means there is no correlation, and 1 is a perfectly straight line with a positive slope
      • Least-squares regression line / best-fit line: a straight line that best matches the pattern of the data (technically, the line where the sum of the squares of all the residuals is smallest, which is why it is called a "least-squares" regression line)
      • Percent explained: the square of the correlation coefficient (r2) -- when converted to a percent, it tells you how much x explains the variation in y
      • Residual: the error in the best-fit line's prediction of y (actual y-value - predicted y-value)
      • Interpolation: using a best-fit line to predict a y value given an x-value that falls between the min and max x-values (within the min and max x-values)
      • Extrapolation: using a best-fit line to predict a y value given an x-value that falls outside the min and max x-values (external from the min and max x-values)
      Practice

          For each of the graphs above (figures 1-4):
          • a. Is there a strong association to this shape?
          • b. Is the correlation strong or weak?  Positive or negative?  Guess ‘r’, the correlation coefficient.
          Answer the following general questions:
          • 5. If there is an explanatory variable, which axis should it be placed on?  What should you do if one variable does not clearly explain the other?
          • 6. Is time a quantitative variable or a categorical variable?  Based on this answer, can you make a scatter plot of time and another quantitative variable?
          • 7. Imagine that I have a distribution of 40 heights that I collected in a survey.  I also have another distribution of 40 weights that I collected from a different survey.  Are both of these quantitative variables?  Based on this, can I use this data to create a scatter plot?  Explain.
          • 8. What does the value of r2 tell you?
          For each scenario below, do the following:
          • a. What would each dot on this scatter plot represent?  Be as specific as possible.
          • b. Do you think the two variables are associated?  Is it a positive or negative association?
          • c. Of the two quantitative variables, which is the explanatory (independent) variable and which is the response (dependent) variable?  Explain why.  If neither is the cause, is the association a coincidence or is there a lurking variable?  If a lurking variable, explain your theory.
          9. The weekly grocery bill is associated with the number of family members.
          10. Life insurance companies base their premiums (monthly price) partially on the age of the applicant.
          11. Shoe size is highly correlated with reading scores before age 14.
          12. A study is done to determine if elderly drivers are involved in more motor vehicle fatalities than all other drivers. The number of fatalities per 100,000 drivers is compared to the age of drivers.
          13. On a summer day, the number of people sleeping at a given hour is associated with the temperature outside.
          14. Utility bills vary according to power consumption.

          For each table below (15 and 16), do the following:
          • a) Which variable in the table is the likely cause?  Which axis should it be plotted on?
          • b) For #15, create a scatter plot of the data on paper.  Remember to accurately label the graph.  Don't be a slacker, actually do it.
          • c) Create a scatter plot in StatKey.
          • d) Is there a relationship?  What is the shape?  Take a guess at ‘r’, the correlation coefficient.
          • e) Calculate r and r2.  Interpret r2.
          • f) Find the equation of the line of best fit.
          • g) For #15, draw this equation on the graph.  To do this, use the equation with a low x-value (such as 0) and a high x-value (such as 75) and find the predicted y-values.  Use these 2 points to draw your line.  Don't just guess.
          • h) Explain the slope in context (with units).
          • i) Explain the y-intercept in context.  Is this intercept extrapolation?  Why or why not?
          • j) For #15 find the residual when age is 30.  What does this tell you?
          • k) For #15, imagine a new point is added at (45, 2.1).  Is this an outlier?  Is it an influential point?
          • l) For #15, imagine a new point is added at (100, 17).  Is this an outlier?  Is it an influential point?
          15. Use the chart below to compare the age at which someone quit smoking to their cumulative risk of lung cancer.

          Age, CumRisk
          0, 0.2
          30, 1.1
          40, 2.6
          50, 5.6
          60, 11.1
          75, 15.7

          16. Use the chart below to compare the current blood alcohol content of different people who started with a BAC of of 0.12 different numbers of hours ago.

          Hrs, BAC
          2,0.041
          1,0.093
          1,0.083
          3,0.017
          3,0.006
          2,0.023
          1,0.092
          2,0.044
          3,0.018
          4,0.002
          1,0.085
          2,0.056

          17. Percentage of workers in a union over time (see sub-questions below)

          Year Percent
          1945 35.5
          1950 31.5
          1960 31.4
          1970 27.3
          1980 21.9
          1986 17.5
          1993 15.8
          • a) Calculate the least squares line. Write down the equation in the form of y=mx+b.
          • b) Identify the slope of the regression line and explain what it means in this problem.
          • c) Identify the intercept of the regression line.  Does this have any realistic meaning?
          • d) Does it appear that a line is the best way to fit the data?  Why or why not?
          • e) Find the correlation coefficient. Does this imply that the line is a good fit?
          • f) How much (what percent) does the size explain the price?
          • g) Use interpolation to estimate the percent of people in a union in 1990.
          • h) Use extrapolation to predict the percent of people in a union in 2012.  Do you think this is a reasonable prediction?
          • i) Use extrapolation to predict the percent of people in a union in 2035.  Does this prediction make sense?


          Free Response Questions
            • The y-intercept is rarely useful on its own in a regression equation.  However, it is still necessary.  Why?
            • In more advanced regression with multiple explanatory variables, you may end up with an equation like the following: y=26000+4000x1+800x2.
              y is the projected income, x1 is the number of years of education after high school, and x2 is the number of years of experience in a single industry.  How much would you expect your income to change if you go back to school for a 2-year master’s degree without any additional years of experience?  Explain and show your work.
            • Imagine a scatter plot that follows an exponential pattern, or some other non-linear pattern.  Can you still do linear regression on this data?  How?


            Practice solutions
                1. a) yes, strong association
                  b) medium correlation, positive (it is going up), so r ≈ 0.8
                2. a) yes, strong association
                  b) bad correlation, barely negative (goes down AND up), so 
                  ≈ -0.1 [note: the only reason I wouldn't guess 0 is because of the extra point on the top left]
                3. a) yes, strong association
                  b) very strong correlation, negative (it is going down), so
                   ≈ -0.99
                4. a) yes, strong association
                  b) medium correlation, negative (it is going down), so r ≈ -0.8
                5. Explanatory variables always go on the x-axis.  If neither variable is very explanatory of the other, it doesn't matter which axis you place it on.
                6. Time is quantitative -- you can count it numerically.  Yes, you can plot time against another number (such as weight at different ages).  These are often called time plots.
                7. Both are quantitative variables.  However, this data cannot be used for a scatter plot.  A scatter plot uses two quantitative variables measured on the SAME individual.  Thus, every dot on a scatter plot represents one individual in the study.
                8. r^2 tells you how much x explains (or predicts) the variation in y.  Assuming you plotted your axes correctly with your causation, this is a good measure of how well you can predict your response variable when you know your explanatory variable.  As r^2 gets bigger, the response variable gets more predictable.
                9. a) every dot represents an individual; in this case, a family
                  b) yes -- they seem related, and both should go up together (positive association)
                  c) family size explains the grocery bill because you have more people to feed (and a high grocery bill clearly doesn't cause your family to grow)
                10. a) every dot represents an individual; in this case, a person
                  b) yes -- older people seem like they should get different rates for life insurance than younger people, and older should mean more expensive (so positive association)
                  c
                  ) age explains premium because, on average, old people live shorter lives than young people
                11. a) every dot represents an individual; in this case, a kid
                  b) yes, as kids grow they wear bigger shoes and read better (so positive association)
                  c) the lurking variable is age, because older kids read better and grow their feet!
                12. a) every dot represents a measurement in time comparing age and a current rate; this is not as easy as the other questions because we are plotting the rate, or average number of accidents, for people of a given age, instead of just plotting individuals
                  b
                  ) yes, I think that older people are less likely to see well or drive the right speed and so are more likely to get in accidents
                  c
                  ) age explains driving accidents (see above)
                13. a) every dot represents a new measurement of # of people taken at a different time (when we look at time instead of individuals, it is often called a time plot)
                  b
                  ) yes -- more people sleep at night, and the temperature is usually lower at night, so there is an inverse relationship (negative association)
                  c
                  ) both are a response to time of day, though temperature may be considered more a cause than # of sleepers
                14. a) every dot represents a new monthly bill -- how much it cost and how much power was used
                  b) yes -- more power = larger bill, so positive correlation
                  c) the power company charges you based on how much you use, so usage causes the bill
                15. a) age when smoking stopped (x-axis) affects the risk of lung cancer (y-axis)
                  b) do what you see below, but do it by hand.
                                     Age when 75-year-olds stopped smoking vs. percent with lung cancer
                   
                  c

                  d yes there is a relationship -- it looks exponential/logarithmic; r ≈ 0.9 (the points still form a pretty straight line)
                  e) r = 0.896, r2 = 0.803
                      r is a measure of how well the data form a straight line; in this problem, the data forms a fairly straight line with a positive slope
                      r2 is how much x explains the variation in y; in this problem: the age at which smoking ceased explains 80% of the variation in cumulative risk of lung cancer
                  f) y = 0.212x - 2.955
                  g) To do the drawing, calculate endpoints of a line:
                      for x=0: y=0.212(0)-2.955=-2.955
                      for x=75: y=0.212(75)-2.955=12.945
                      SO plot the points (0,-2.955) and (75,12.945) and draw a line through them.
                  h) Each additional year that someone keeps smoking is linked to an increased cumulative risk of lung cancer of 0.212%.
                  i) If a person quit smoking at age 0 (i.e. they never started smoking), they would have a cumulative risk of lung cancer of -2.955%.  This does not make sense (0% risk is the lowest you can go), but it is NOT extrapolation.  Our lowest x-value is an age of 0, so it is interpolation and it has a small residual.
                  j) Predicted: y=0.212(30)-2.955=3.405, Actual: 1.1
                      Residual: 1.1 - 3.405 = -2.305 (note that it is negative...residuals do that sometimes)
                  k) It is an outlier because it is a ways off of the pattern (especially the trend line).  It is not super influential because it falls nearby other data points, not off in space alone.
                  l) It is not an outlier -- it is almost on the trend line.  However, it has a lot of influence over the trend line.

                16. a) time since drinking ceased (x-axis) affects BAC
                  b) n/a
                  c) do it!
                  d) yes -- linear, perhaps r ≈ -0.9 (the correlation is strong and the slope is negative) 
                  e) r = -0.94, r2 = 0.88
                      r2 is how much x explains the variation in y; in this problem: the time since you stopped drinking explains 88% of the variation of BAC
                      (contrary to popular belief, this relationship is in fact very linear -- a person cannot speed up a decrease in BAC; many people drive under the influence as a result)
                  f) y = -0.032x +0.114
                  g) n/a
                  h) For every additional hour after drinking stops, BAC reduces by 0.032.
                  i) When someone has stopped drinking for 0 hours (the moment they stop), their BAC is 0.114.  This is close, but not exactly what we know to be true (since the study had people start with exactly 0.12 BAC)
                  j) n/a
                  k) n/a
                  l) n/a


                17.    
                  a. y= -0.406x+825.384
                  b. slope= -0.406, 
                  This means that each year, about a half of a percent less of the population is in a union.
                  c. The intercept is 825.384.  This means that around the birth of Christ, 825.384% of the population was in a union.  Since that doesn’t make a lot of sense, all the y-intercept does in this problem is make sure that the percent for values in the 1900s make sense.
                  d. Yes – there is a clear linear relationship.
                  e. r = -0.980.  This is very good.
                  f.  r2 = .960.  Thus, the year explains 96.0% of the variation of what percent of workers are in a union.
                  g. 
                   (-.406)(1990) + 825.384 = 17.4%
                  h. (-.406)(2012) + 825.384 = 8.51%.  WikiAnswers said 9.4%, so 8.51% is not a bad guess.  Note that linear predictions of things like percentages tend to under-estimate when they get close to 0%.
                  i. (-.406)(2035) + 825.384 = -0.83%.  Since you cannot have a negative percent of the population in unions, this estimate is obviously too low.  However, the actual answer will probably approach 0%.
                Notes
                    Class activity for correlation: 
                    https://teacher.desmos.com/polygraph/custom/55bbb86bf1f5fa59061c3141


                    Sources
                    Table for #15: http://info.cancerresearchuk.org/cancerstats/causes/lifestyle/tobacco/#Lung
                    Parts of problems 5-14: Dean, S., & Illowsky, B. 2011. Linear Regression and Correlation: Homework. Connexions, August 11, 2011. http://cnx.org/content/m17085/1.12/. 

                    Powerful tool to play with regression:
                    http://www.shodor.org/interactivate/activities/Regression/

                    More datasets:
                    http://blog.benwildeboer.com/2014/linear-data-sets-for-your-enjoyment/
                    Ċ
                    Andy Pethan,
                    Nov 19, 2014, 8:23 AM
                    Comments