Post date: Nov 11, 2015 1:3:7 AM
From the files.
Anne.Terakshun@multipleregress.fun wrote:
I'm having trouble with creating a two-factor interaction versus residuals plot in R, I've been trying a bunch of techniques from Google and none work. Can you help?
With quantitative predictors, a (2nd order (polynomial)) interaction term for x1 and x2 is the created variable x1 x2 (the product of x1 and x2). The idea here is to accommodate situations where the change in E{Y} as x1 increases by 1 depends on the value / level of x2. (With no interaction this change does not depend on x2, which may in practice be unrealistic.) So, if you plot the residuals versus a potential interaction and see some pattern, you would imagine that your model should perhaps include an interaction.
Here's an example, with a plot. It's real easy. The data frame containing these variables is attached below. Load the workspace - the data frame is ex_interaction.
> head(data.frame(x1,x2,y),3)
x1 x2 y
1 1.404058 76.75276 527.9944
2 7.388037 41.73577 774.2387
3 5.307568 56.75758 824.5427
> summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1766 1.4340 2.7170 3.6860 5.4400 8.6670
> summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 39.10 50.80 53.34 68.78 93.36
> r <- lm(y~x1+x2)
> summary(r)
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-171.651 -55.748 9.594 43.380 246.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -311.6688 58.8482 -5.296 8.39e-06 ***
x1 85.4943 6.5394 13.074 2.22e-14 ***
x2 10.7111 0.8428 12.709 4.76e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 99.48 on 32 degrees of freedom
Multiple R-squared: 0.8913, Adjusted R-squared: 0.8845
F-statistic: 131.2 on 2 and 32 DF, p-value: 3.794e-16
> anova(r)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 998302 998302 100.89 2.035e-11 ***
x2 1 1598377 1598377 161.53 4.756e-14 ***
Residuals 32 316651 9895
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(x1*x2,resid(r))
Now at this point - even though R2 is "high" you might suspect that x1 x2 (or at least something involving both) ought to be included in the model. See how the residuals "depend" on x1 x2? Here's the rerun with the residual plot redone.
> ri <- lm(y~x1+x2+I(x1*x2))
> summary(ri)
Call:
lm(formula = y ~ x1 + x2 + I(x1 * x2))
Residuals:
Min 1Q Median 3Q Max
-3.3803 -0.9208 -0.1004 1.4465 2.4664
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.085185 1.495939 3.399 0.00187 **
x1 -1.975003 0.309193 -6.388 4.09e-07 ***
x2 4.017668 0.026753 150.174 < 2e-16 ***
I(x1 * x2) 1.994334 0.006493 307.147 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.832 on 31 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.894e+05 on 3 and 31 DF, p-value: < 2.2e-16
> anova(ri)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 998302 998302 297521 < 2.2e-16 ***
x2 1 1598377 1598377 476360 < 2.2e-16 ***
I(x1 * x2) 1 316547 316547 94340 < 2.2e-16 ***
Residuals 31 104 3
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(x1*x2,resid(ri))
Now - no residual pattern and an almost perfect fit. (Just look at the change in the magnitude of the residuals - the estimated standard deviation of the residuals drops from 99.5 to 1.83.) The partial R2 is pretty much 100%. Granted - somewhat a cooked example, but there you go.
Perhaps a 3d plot gives evidence of this - so you'd "see it" first? (Of course this method fails if there's an x3 and you need a 4d plot.)
> library(scatterplot3d)
> scatterplot3d(x1,x2,y,highlight.3d=TRUE,col.grid="lightblue")
Can you tell from this plot that a straight "linear in both variables only" model
E{Y} = beta0 + beta1 x1 + beta2 x2
is deficient? Maybe? I'm not so sure about that. And maybe if you "spin it". But certainly it's very clear from the residuals vs. x1 x2 plot that some form of interaction is present.
So here's the fit
Y-hat = 5.085 - 1.975 x1 + 4.018 x2 + 1.994 x1 x2
Alpha is pretty good for plotting this sort of thing (and the pro version will allow you to spin it and change perspective / angles / scales etc.)
What you have here is a "skew plane". (Contours for an ordinary plane would be linear.) See how crosscuts are all linear, but the slopes are different? For x2 (y in the plot) = 20 the slope, as x1 (x in the plot) increases, is moderate. For x2 = 100 it's much higher. To go a little further:
Where x2 = 20: Y-hat = 5.085 - 1.975 x1 + 4.018(20) + 1.994(20) x1 = 85.44 + 37.91 x1
Where x2 = 100: Y-hat = 5.085 - 1.975 x1 + 4.018(100) + 1.994(100) x1 = 406.85 + 197.46 x1
Good question. Thanks for asking, Anne!