Examining Residuals
The following diagram shows the important parts of a least-squares regression.
The regression line is fit through the observed data points so that it:
passes through the mean X-variable value and the mean Y-variable value, and
minimizes the overall variable from the line to the observed data points (when this is measured vertically).
How good is the fit of a line? There are two types of measures.
Probably the most common measure of fit is based on the product-moment correlation coefficient (identified as the r value). The value of r varies from positive one to negative one. When r is negative, the line goes downward to the right. When the correlation coefficient is squared, you get a term called the "coefficient of determination." This value tells you the percentage of the variation in the Y-variable values from the mean that is explained in terms of the regression relationship. That is, the amount of variation from the mean that is accommodated by the predicted values of each variable.
The other measure of the quality of the fit line is given by the PROB>F value. This probability helps you determine whether the correlation was produced by chance alone. It takes into consideration how many observations have gone into the calculation. If a small probability value is produced (0.05 or smaller), it indicates that you can have confidence in the correlation value. For example, you may have a small correlation value, but if it is based on many observations, it could be a reliable indication of the relationship between the two variables. It means, in this example, that only part of the variation is explained by the regression relationship.
You can not rely completely on calculated values of rand PROB>F to determine how well the line fits the data. It is equally important to examine a plot of the residual variation. This "residual" variation is the variation from the mean that is not accounted for by the regression.
Four cases have been devised (Anscombe, F. J. 1973. Graphs in Statistical Analysis, Am. Statistician 27: 17-21) that show some of the classical problems of fitting data by least-squares techniques. While the problems in these examples will be obvious, don't be mislead. The same sorts of situations can occur in your data but they many not be seen as clearly.
The four sets of Anscombe data are particularly fascinating because all of the sets have identical properties according to their descriptive statistics and measures of fit with least-squares procedures.
ANSCOMBE MODEL COEFFICIENT r PROBF
CASE a b SQUARE
Residual var. Yl=a+b*Xl 3.0 0.5 0.667 0.0022
Specif. Err. Y2=a+b*Xl 3.0 0.5 0.667 0.0022
Outlier Y3=a+b*Xl 3.0 0.5 0.667 0.0022
High-leverage Y4=a+b*X4 3.0 0.5 0.667 0.0022
The four cases are plotted on the following pages with their line of best fit and residuals. Note the following characteristics of the fits.
Residual variation. This curve is what you would expect. Variation from the curve is randomly arranged about the regression line. This is an acceptable least-squares fit.
Specification error. Clearly, a simple smooth curve would better fit these data then the straight line that has been used.
Outlier. If you could ignore the data point at X=13, a straight line would go right through the data points. Note how strongly this one data value influences the trend of the line.
High-leverage point. All of the X values are the same except for one. Yet this one point has completely shifted the direction of the least-squares line.
Corrective actions may sometimes be taken in cases where there is a clear distortion of the intent of the fit. Note that the Anscombe cases are extreme examples. The more usual situation is to have some intermediate condition. As a result, the influence of possible distorting conditions still exists, although it will be less apparent than in these examples. It is strongly suggested that you carefully examine the residuals from your regression analyses to see if there are major problems and sometimes to get hint of possible corrective actions. In your work with regressions, make sure that you always plot the residuals and examine their pattern for the following conditions.
If the least-squares fit is "ideal" the residuals are normally distributed across the plot with a mean value of zero. In this case you need not look any further.
If a few of the residual values are much larger in magnitude than all the other, you have an indication of outliers. You must decide whether you want to keep these outliers in your analysis or not. Often, you will want to produce two analyses; one with the outliers and the other without them. This will help you interpret the differences.
A curved relationship may be shown on your residuals plot. This indicates that a non-linear relationship (called a "specification error" in the Anscombe cases) exists in your data. To correct this, you might first try transforming they variable values by taking the log of each Y value. Additionally, you might similarly transform the X-variable values or add an additional X term, such as an Xsquared value for each observation. An example of this procedure are given later (page 191).
You might have a progressive change in the variability of the residuals. This would appear as a triangle of values. It may be useful to transform your Yvalues by taking their log values.
A skewed (or other non-normal) distribution of the residual values may appear in your residual plot. This may also be helped by using a log transformation of your Y -variable values.