Example of how not using regression properly, to include eyeballing the data, can lead to a conclusion that is not supported by the data. Supporting files are attached at the bottom of this page.
Software used for this example:
The example below is an example of the use of least squares regression to compare returns rates between two years. It demonstrates how not using a regression model properly can lead to erroneous conclusions. It then also demonstrates how to properly conduct the analysis and which conclusions may be drawn.
Data
Incomplete (half-baked) approach
Scatterplot using Excel
Does NOT include regression confidence bands
Regression equation using Excel
no model F-test
no interval estimates for model coefficients
More rigorous approach
Regression model using R
includes table for testing overall model significance (model F-test)
includes table of coefficients, along with standard error and significance
includes review of model assumptions, residuals (not yet added to this page, but performed)
Scatterplot using R,
Does include regression confidence bands
Are the slopes equal?
The incomplete approach just visually looks at the scatterplots with the straight line drawn through the data.
The more rigorous approach uses the Homogeneity of Slopes model
See also Regression - Algebra vs Calculus and Probability
Results using Excel
Visually, the two slopes appear to be different. One (year_01) appears to be decreasing over time, while the other (year_02) appears to be increasing over time. But is there really enough evidence to support this claim?
Using the Excel regression equations, there is a tendency to make comparisons of the point estimates of the slopes of the two lines.
year_01 slope = -0.9833
year_02 slope = +4.2167
The two slopes have different signs and values, therefore the two slopes are different.
BUT IF WE PERFORM THE REGRESSION PROPERLY, WE FIND THAT THIS CONCLUSION IS NOT SUPPORTED BY THE DATA.
The issue is not necessarily with Excel. If Excel were used properly to complete the test of hypothesis, the unsupported conclusion might have been avoided. However, it is not uncommon to find some people who use Excel for this type of work and who do not know how to complete the model or test the hypotheses. The results from Excel shown here are typical of what may be found from the use of a spreadsheet as a statistical analysis package. The same unsupported conclusions may also be found when using R (or any statistical software package) if the same incomplete work is performed.
Other examples of incomplete analyses include the over-reliance on the R-square metric, not testing model assumptions or looking at model residuals, and omitting any observations that interfere with a high R-square value (making the data fit the model instead of using the model to describe the data).
Results using appropriate statistical method
Model results for year_01
We find that the slope is not statistically significant. We cannot reject the null hypothesis that the slope is zero.
Model results for year_02
We find that the slope is not statistically significant. We cannot reject the null hypothesis that the slope is zero.
Graphical comparisons, but with the addition of regression confidence bands.
The slope of the regression lines for each year could be anywhere in between the regression confidence bands.
Placing both year_01 and year_02 on the same graph:
Note the large overlap between the regression confidence bands.
Homogeneity of Slopes Model
The statistical test for whether or not the slopes for year_01 and year_02 are equal is in the coefficient of interaction term (month:yearyear_02). Here we find that the coefficient is not statistically significant. Therefore we cannot reject the null hypothesis that the two slopes are equal.
TBD - might later add information related to model residuals and testing model assumptions