In giving a numerical example to illustrate a statistical technique, it is nice to use real data. In many situations, the reader can see how the technique can be used to answer questions of real interest. Our example here, however, uses real data to illustrate a number of regression pitfalls. Some of the data come from Yule and Kendall, An Introduction to the Theory of Statistics (14th ed.), Charles Griffin, London, 1950. Another variable was added in Tufte, Data Analysis for Politics and Policy, Prentice Hall, Englewood Cliffs, NJ, 1974. (Click on the title of the second book to get a free copy on-line.) Yule and Kendall had pointed out a nonsense correlation between the number of radio licenses (in millions) and the number of mental defectives (per 10,000 of estimated population) in Great Britain during the years 1924-1937. (In Great Britain, when you buy a radio you buy a license to use it, with the revenue going to support public broadcasting.) You can find the data at our site as a plain text file, or as an Excel spreadsheet. Download the text file now and save it to the directory where you installed R. To use the file in R you must define a variable to be equal to the contents of this file.
> radios <- read.table("radios.txt",header=TRUE)
Here radios <- can be read as "create a variable radios and in it store the results of...". The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. You can then get a table of contents for what you have created in R with
> radios Year licenses defects lengths 1 1924 1.350 8 6 2 1925 1.960 8 6 3 1926 2.270 9 6 4 1927 2.483 10 6 5 1928 2.730 11 6 6 1929 3.091 11 7 7 1930 3.647 12 7 8 1931 4.620 16 7 9 1932 5.497 18 7 10 1933 6.260 19 8 11 1934 7.012 20 8 12 1935 7.618 21 8 13 1936 8.131 22 8 14 1937 8.593 23 8
First, are there any correlations here?
> cor(radios) Year licenses defects lengths Year 1.0000000 0.9862573 0.9815811 0.9434564 licenses 0.9862573 1.0000000 0.9920525 0.9425656 defects 0.9815811 0.9920525 1.0000000 0.9332053 lengths 0.9434564 0.9425656 0.9332053 1.0000000
We found it easiest to just apply the cor function to the entire data set. Here we can see, for example, that the number of licenses is highly correlated with Year (r=0.9862573) as is the number of defectives (r=0.9815811). This is very common with time series data. Any two variables that move together in a time series will show correlation, but there may be no real relationship between them other than that they happened to both be changing steadily during that time period. To sharpen the point, Tufte added a third variable: the number of letters in the first name of the person who was President of the Untied States during (most of) that year. As it happened, that variable was steadily increasing during this time period (r=0.9425656). As a result, we can predict the number of mental defectives using licenses or lengths.
> lm1 <- lm(defects ~ licenses, data=radios) > summary(lm1) Call: lm(formula = defects ~ licenses, data = radios) Residuals: Min 1Q Median 3Q Max -0.9024 -0.5181 -0.2144 0.4317 1.3014 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.5822 0.4233 10.82 1.51e-07 *** licenses 2.2042 0.0807 27.31 3.58e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7262 on 12 degrees of freedom Multiple R-Squared: 0.9842, Adjusted R-squared: 0.9828 F-statistic: 746 on 1 and 12 DF, p-value: 3.577e-12
Here lm1 is a variable in which to store the results of fitting the model. The lm function (linear model) fits the model. The arguments are the dependent variable followed by "~`" followed by a list of independent variables. The additional (optional) argument data=radios tells R which data set to look in to find these variables.
> lm2 <- lm(defects ~ lengths, data=radios) > summary(lm2) Call: lm(formula = defects ~ lengths, data = radios) Residuals: Min 1Q Median 3Q Max -3.8571 -0.9571 0.1429 1.2179 3.1429 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -26.4429 4.6242 -5.718 9.63e-05 *** lengths 5.9000 0.6558 8.996 1.11e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.074 on 12 degrees of freedom Multiple R-Squared: 0.8709, Adjusted R-squared: 0.8601 F-statistic: 80.93 on 1 and 12 DF, p-value: 1.109e-06
We see, for example, that 87.09% of the variability in the number of mental defectives is "explained by" the number of letters in the President's first name. Because this "explanation" lacks credibility, some prefer to refer to the variability as "shared" between the two variables. During this period, the two variables showed pretty much the same trend. (We leave it to the reader to explore the impact of Harry S. Truman becoming President in 1949.) If this all seems a little silly, just remember that the next time you see a high correlation, it could be equally silly.
For our purposes today, though, we will continue and illustrate the use of two independent variables in a single model.
> lm3 <- lm(defects ~ lengths + licenses, data=radios) > summary(lm3) Call: lm(formula = defects ~ lengths + licenses, data = radios) Residuals: Min 1Q Median 3Q Max -0.9135 -0.5461 -0.2124 0.4299 1.2721 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.1601 3.9387 1.310 0.217 lengths -0.1059 0.7174 -0.148 0.885 licenses 2.2393 0.2521 8.882 2.38e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7577 on 11 degrees of freedom Multiple R-Squared: 0.9842, Adjusted R-squared: 0.9813 F-statistic: 342.6 on 2 and 11 DF, p-value: 1.238e-10
The F-test (sample F=342.6, p=1.238e-10=1.238X10-10=0.0000000001238) indicates that the model as a whole reflects a real association between the dependent variable and the independent variables as a group. The t-test for each coefficient indicate that licenses makes a strong contribution to the model (t=8.882, p=0.00000238) while lengths does not (t=-0.148, p=0.885). Note that when we fit a model with lengths alone as independent variable we got a very different result (t=8.996, p=0.00000111). This is due to the conditional nature of the coefficients in multiple regression. The variable lengths by itself shares much of its variability with defects, but once you include licenses there is very little additional shared variability. This is a common problem when we have lots a variability shared among the independent variables, something called "multicollinearity". Here we have r=0.9425656 between lengths and licenses. When we have multicollinearity, it is common for individual variables to change in significance and their coefficients to alter and even reverse sign from one model to the next, as happened here.
Regression analysis involves a number of assumptions. Checking those here is especially enlightening because doing so shows how to detect nonsense regressions that are not so obviously nonsense.
First, let's talk about the design of the study and the nature of the variables. This is not a random sample of years, it is fourteen consecutive years. Time series data such as this rarely constitutes a random sample. We also rarely have independence of observations for time series data. Instead, a given time period is likely to give results similar to the preceding and the next time period -- at least more similar than the results for distant time periods. This problem is called "autocorrelation". Nor is this any kind of experiment. We cannot control the number of radios licensed nor the number of letters in the President's name..
There are also some standard checks we can make of the data once it has been collected. The models above all assume that we have constant variance in the residuals. This is checked by plotting them against predicted y. First, extract these from the results of the fitted model and store them in variables.
> res <- residuals(lm3) > fits <- fitted(lm3)
Then make a plot.
> plot(res ~ fits)
We'd like to see random scatter, and maybe we do on the left, but to the right the points make a wandering path -- probably because this is time series data.
Another assumption is that the relationship between the dependent variable and each independent variable is linear. We usually plot residuals against each independent variable to see if there are signs of curvature.
> plot(res ~ licenses, data=radios)
> plot(res ~ lengths, data=radios)
The first plot looks about the same as the previous one but here suggests possible curvature in the relationship with licenses. The second plot looks very different because there are only three different values of lengths during these years. There are very slight signs of curvature here (higher in the middle) and very slight signs of different variances at the three levels. However, with only four or five observations in each group, we should not take these too seriously.
Normality of the residuals can be assessed informally with simple univariate displays.
> stem(res) The decimal point is at the | -0 | 96665 -0 | 4310 0 | 4 0 | 57 1 | 23
This shows the residuals skewed toward high values, suggesting some sort of transformation for the dependent variable. This shows up as curvature in a normal probability plot.
> qqnorm(res) > qqline(res)
This creates the plot and then adds a line to it to guide the eye in estimating whether the points follow a straight line.
The skewness could also be related to curvature in the relationship withlicenses. Note that in that residual plot we had clumps of low values at each end and a few high values in the middle. These high values are the tails on the stem and leaf.
Exercises
1. Conduct a one-way ANOVA with "sat" as the response variable and "region" as the explanatory variable. What do you discover?
2. Check that the assumptions of the one-way ANOVA are satisfied for question 1.