We will use an example data set from Regression Analysis by Example (4th ed.) by Chatterjee and Hadi (Wiley, New York, 2006). The web site for this book is at http://www.ilr.cornell.edu/~hadi/rabe4/. We will use the computer repair data. In this study a random sample of service call records for a computer repair operation were examined and the length of each call (in minutes) and the number of components repaired or replaced were recorded. The data are in file P027.txt. It may be loaded from RStudio's Workspace tab
> Import Dataset > From Web URL
or downloaded and saved as a .txt file and imported from there. After loading, name your data.frame() "repairs".
> attach(repairs) > repairs Minutes Units 1 23 1 2 29 2 3 49 3 4 64 4 5 74 4 6 87 5 7 96 6 8 97 6 9 109 7 10 119 8 11 149 9 12 145 9 13 154 10 14 166 10
As always, the first step is to look at your data.
> hist(Minutes) > summary(Minutes) > hist(Units) > summary(Units)
We could have made histograms or boxplots. We simply want to see if there are any peculiarities in the data for each variable by itself before we look into relationships between variables. We see none here.
> plot(Units,Minutes)
Note that the first variable in the plot command is plotted on the horizontal axis. We are not surprised to see that the length of a service call increases with the number of components repaired or replaced.
> cor(Units,Minutes)
The regression command is lm for linear model. We will store that model in a variable called model. The order of the variables is dependent followed by a tilde "~" followed by a list of independent variables.
> model = lm(Minutes ~ Units) > model Call: lm(formula = Minutes ~ Units) Coefficients: (Intercept) Units 4.162 15.509 > summary(model) Call: lm(formula = Minutes ~ Units) Residuals: Min 1Q Median 3Q Max -9.2318 -3.3415 -0.7143 4.7769 7.8033 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.162 3.355 1.24 0.239 Units 15.509 0.505 30.71 8.92e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.392 on 12 degrees of freedom Multiple R-Squared: 0.9874, Adjusted R-squared: 0.9864 F-statistic: 943.2 on 1 and 12 DF, p-value: 8.916e-13
The regression equation is minutes = 4.162 + 15.509*units. The "t values" test the hypotheses that the corresponding population parameters are 0. Usually we test whether the slope is zero because if it is then the model is not much use. Here the p-value for that test is "8.92e-13" which is to say 8.92x10-13 or 0.000000000000892, so we would reject the hypothesis that the slope is zero.
If you wish to test a nonzero value, subtract it from the coefficient in the regression output (15.509) and divide the result by the coefficient's SDE (0.505). Similarly, if you want confidence intervals, use the coefficient plus or minus the product of its SDE with a t-value for the desired confidence level and 12 degrees of freedom. This also works for the intercept (4.162) using its SDE (3.355).
To plot the regression line on the scatterplot, type (the scatterplot needs to be the active plot because abline() writes onto it).
> abline(model)
Regression through the Origin
To fit a regression line through the origin (i.e., intercept=0) redo the regression but this time include that 0 in the model specification.
> model2 = lm(Minutes ~ 0 + Units) > summary(model2)
This model is minutes = 16.0744*units. We can add this line to the graph to see how different it is.
> abline(model2, lty = "dotted")
Not much.
We predict the length of a service call with four components repaired or replaced, then get confidence intervals for the prediction of a single observation and for the mean of all observations with Units=4 (and based on our first model).
> predict(model, newdata=data.frame(Units=4)) [1] 66.19674 > predict(model, newdata=data.frame(Units=4), interval = "pred") fit lwr upr [1,] 66.19674 53.83936 78.55413 > predict(model, newdata=data.frame(Units=4), interval = "confidence") fit lwr upr [1,] 66.19674 62.36271 70.03077
If we want to obtain the entire confidence bands and prediction bands, we may use
> library(fastR) #If not installed, go to Tools > Install Packages, 'fastR'
> xyplot(Minutes ~ Units, panel=panel.lmbands)
> rm(model1, model2)
> detach(repairs)
Exercises
This exercise uses the States95 dataset. For directions on how to access it, see the Getting Started with R tutorial.
1. Try to find the best simple linear regression model for explaining "sat" scores. Make three simple linear regression models with "sat" as the response variable. Which of the three is the best? Defend your answer. (Hint: You only need to use the lm(), plot(), and abline() commands. Judge which is best using the scatterplot. We will discuss more technical ways of assessing "best" in future lessons.)
2. If it has been covered in class, check that the assumptions of linear regression have been satisfied for your best model (i.e. you need to only check for one model). (Note: I do not cover this in Math 210 IntroStats, but I do in Math 318 Biostats.)