In the last section we learned about the concept of correlation, which we defined as the measure of the linear relationship between two numerical variables. We saw that when the points of a scatterplot formed a clear linear pattern, then the points were said to have a high correlation. Scatterplots can have a strong correlation in either a positive (increasing to the right) or a negative (decreasing to the right) direction. We have also discussed the idea of drawing a line-of-best-fit through the data. In some scatterplots this is easy to do and all of us would end up with our lines in nearly the same place. However, if everyone were to simply draw a line where they think it fits or to select two of the points to calculate a line through, our lines and equations would certainly vary from person to person. Therefore, we will use a specific formula to calculate the equation for the line-of-best-fit.
Linear regression involves using data to calculate a line that best fits the data and then using that line to predict scores. We will use the Least-Squares Regression Line (LSRL) - the line that makes the sum of the squares of the vertical distance of each data point from the line the least possible value. This is the standard regression equation that is used most often. It is the one that your graphing calculator and Excel will calculate for you. The formula and process to calculate this is quite tedious, so we will use technology to find the LSRL equations. The regression equation will be in the form of:
where a is the y-intercept and b is the slope of the equation. Your calculator will calculate the correlation coefficient (r) at the same time as it calculates the LSRL equation. Many will also report a value for r2 (which is exactly what it says; r-squared). The r2 value is called the coefficient of determination, it reports the percent of variation in our data that is explained by our LSRL equation. We will not be addressing its importance in this course.
To calculate the LSRL equation and correlation coefficient, use a graphing calculator or computer program. See the appendix at the end of this book for the steps to calculate the LSRL and correlation on a calculator.
Here's a link to directions for LSRL with TI-Nspire Calculator
Here's a link to directions for LSRL with Google Sheets
Here's a video for LSRL with Google Sheets
(shows how to get the formula on the graph)
As with all of our statistics, these data, graphs and equations are not meaningless. They represent the relationship between two numerical values measured on several specific individuals. Thus the slope and the y-intercept of our newly calculated regression equation mean something as well. So, we will be interpreting both in context. The interpretation of the slope of the regression equation is the average rate of change in the response variable (y), for each increase of one unit of the explanatory variable (x). You will say something like: For each increase of one (explanatory variable), there will be average (an increase or decrease) of (slope value) in the (response variable).
The interpretation of the y-intercept of the regression equation is the predicted value of the response variable (y) when the explanatory variable (x) is zero. You will say something like: When (explanatory variable) is zero, the (response variable) is predicted to be (y-intercept value). You will discover that the interpretation of the y-intercept often makes absolutely no sense when put into context. This is because actual data rarely involves x-values of zero.
Example 1
Below is data given by a canine expert. It relates a dog's age in years to what they believe the equivalent age in human years to be.
The scatterplot showing this data, using dog age as the explanatory variable, is below.
a) Calculate the Least-Squares regression line for the Dog Year Data. Report you equation. Be sure to identify your variables.
b) Calculate the correlation (r). What two things does r tell us about this relationship?
c) Identify and interpret the slope in the context of the problem.
d) Identify and interpret the y-intercept in the context of the problem.
Solution
a)This was done using Excel, but the graphing calculators will report the same LSRL.
LSRL is:
x = Dog age in years
y = equivalent human years (predicted)
b) r will be the square-root of r2 (The graphing calculators report both r and r2 so you would not need to do any calculating, but Excel only gave r2).
The two things that r tells us are: Because r is positive, this relationship is increasing. And r is very close to one, so this relationship is very strong.
c) The slope is 4.642. It means that for every increase of one year in dog age, there is an average increase of 4.642 years in the equivalent human age.
d) The y-intercept is 7.795. It means that if a dog were 0 years old, it would be predicted to be 7.795 years in human years. (This is clearly nonsense in this case. It would make sense that both start at zero.)
The main use of the regression line is to predict values. After calculating this line, we are able to predict values by simply substituting a value for the explanatory variable (x) and solving the equation for the predicted response value (y). In our example above, we can predict that the human year equivalence for a dog that is 6 years old is approximately 35.6 human years (see equation below). This prediction is reasonable and it matches with our graph. However this is not always the case.
As you look at the LSRL drawn on the above scatterplot, you can see that the points to the far left do not appear to be very linear. So, using the line to the left of about 1 year will not make much sense. Also, we do not have any idea what will happen to the data beyond the 11 years that we have recorded. An LSRL is very useful in making predictions, but only within the range of the actual data that we have collected and can see- this is called interpolation. We can see that this line is a reasonably good fit between 1 and 11 dog years, but we simply do not know what happens beyond 11 years (and we cannot use negative years for obvious reasons). The prediction line that we have calculated will go forever in both directions (remember geometry?), but it will not be appropriate to use it to predict for all values of x. Using a regression line to predict values that are outside the range of our actual data is called extrapolation. Extrapolation will often yield ridiculous answers! However, even if the result seems reasonable, we should avoid extrapolating because we simply do not know what happens beyond our actual observations. Making decisions based on extrapolating can be dangerous as we are coming to conclusions that are not backed up by data.
Example 2
The following table lists the GPA and Verbal SAT Score for seven students. Analyze how well Verbal SAT Scores can be used to predict students' GPAs based on this data.
a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes.
b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables.
c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph?
d) Identify and interpret the slope in the context of the problem.
e) Using your equation, what is the predicted GPA of a student who has a Verbal SAT Score of 500? Of a student with a score of 600?
Solution
a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes.
Here is the scatterplot from a TI-84 plus: [Figure9][Figure10]
Here are the LSRL, correlation, and the scatterplot with the line added to the graph, from a TI-84 plus:
b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables.
LSRL is:
x = Verbal SAT Score
y = predicted GPA
c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph?
The correlation is r = +0.9467. This tells us that the relationship is positive and strong.
d) Identify and interpret the slope in the context of this problem.
The slope is 0.0055. This tells us that for each increase of 1 point on the Verbal SAT Score, there will be an average increase of 0.0055 in a student's GPA.
e) Using your equation, what is the predicted GPA of a student who has a Verbal SAT Score of 500? Of a student with a score of 600?
So, the predicted GPA for a student who scores 500 on the SAT Verbal, is approximately 2.8.
And, the predicted GPA for a student who scores 600 on the SAT Verbal, is approximately 3.4.
An outlier is an extreme observation that does not fit the general pattern of the data (see the example below). Because an outlier is an extreme observation, the inclusion of it may affect the correlation, and the equation for the least-squares regression line. When examining a scatterplot and calculating the regression equation, it is worth considering whether extreme observations should be included or not.
Let's use our GPA example to illustrate the effect of a single outlier. Suppose that we have a student who has scored very high on the SAT Verbal exam, but has a lower GPA. We will change Corbin's results to be 715 on the SAT and a GPA = 2.2, and see what happens to the LSRL and correlation.
Here are the LSRL equation and the correlation coefficient recalculated with Corbin's GPA changed:[Figure12][Figure13]
As you can see, this one change turned Corbin into an outlier. This caused the correlation to drop from r = 0.947, all the way down to r = 0.317. This is a huge change- it makes the relationship between the two variables extremely weak (rather than very strong). Also, this changed both the slope and the y-intercept of the LSRL equation dramatically. This means that predictions based on this LSRL will have different results than those based on the LSRL with Corbin's old GPA.
There is no set rule when trying to decide how to deal with outliers in regression analysis, but you can now see how an outlier really can change everything when it comes to scatterplots, correlation and least-squares regression. Be sure to mention any potential outliers that you observe in any scatterplot.