3.4. Linear models

Script for the examples below

Linear models include many familiar tests and analyses as special cases. The main ones:

Linear regression and correlation -- linear relationship between continuous x and y
ANOVA -- linear relation between group x and continuous y
- Special case: 2 groups: t-test
ANCOVA - linear relation between group X₁, continuous covariate X₂, and continuous Y)

In a bit more detail:

Linear regression

Linear regression involves a model of the form

Y^ = b₀ + b₁X₁ + ..... b_kX_k

where Y^ is the prediction for a continuous response Y, and X₁.... X_k are predictors or independent variables (by observation or design) and assumed to be fixed/ known. The standard method of fitting a regression model is least squares estimation, which simply involves finding values for the model parameters b₀...b_k that minimize the sum squared differences of observed(Y) and predicted (Y^) values. In the special case where the residual errors (Y_i-Y^_i) are normally distributed, the least squares estimates are also the maximum likelihood estimates, with attended validity of confidence intervals, hypotheses tests, AIC comparison, etc. This case is sometimes referred to as normal regression, to distinguish from cases that we will consider later involving other error distribution.

Analysis of variance (ANOVA)

Under ANOVA the "predictor variable" involves discrete groupings (factor levels). However, we still have a linear model, and computations are performed by methods that are very akin to those used in regression (again, minimizing error sum of squares). As with regression, if we invoke additional normality assumptions, distribution-based tests such as the F-test now provide inference about hypotheses of interest and confidence intervals of estimated treatment effects, contrasts, etc.

Analysis of covariance (ANCOVA)

Analysis of covariance is a sort of blend of linear regression and ANOVA in which we are interested in testing and estimating group effects (as in ANOVA) but taking into account the linear effect of one or more covariates. The close connection between regression, ANOVA, and ANCOVA is emphasized by the fact the R uses the same function -- lm() for all 3. This is aptly illustrated by the chick data example.

>data(ChickWeight)

>chicks<-ChickWeight

First, the ANOVA of weight by diet treatment is constructed in lm() and summarized, with an ANOVA table

>#ANOVA

>model1<-lm(weight~Diet,data=chicks)

>summary(model1)

>aov(model1)

The regression of weight on time is likewise built in lm(). It too has an ANOVA table ( 1 df for the slope of the regression)

>##REGRESSION

>model2<-lm(weight~Time,data=chicks)

>summary(model2)

>aov(model2)

ANCOVA puts these together. Note there are 2 possible models: additive Time and Diet effects (same slope of weight vs Time across all the diets) or interactive effects (slopes differ among diets).

>#ANCOVA

>model3<-lm(weight~Time+Diet,data=chicks)

>summary(model3)

>aov(model3)

>model4<-lm(weight~Time*Diet,data=chicks)

>summary(model4)

>aov(model4)

The accompanying script file adds 2 features that often will be useful.

First, I produced a graph that superimposes the predictions and 95% CI under the most general model (model4 above) of weight vs. Time for each of the 4 treatments (separately color coded). The graph makes clear that the slope is not uniform among diets, supporting the interactive model
I have also included prediction intervals (interval="prediction" in predict function).
- A confidence interval gives information about the variation in our estimated of y, given x, so essentially tells us what y will be given x. A confidence interval would shrink to zero if our model fit perfectly.
- A prediction interval provides information about the variation of a value future observed value y, given a specified value of x. A prediction interval will still have width if the model fit perfectly, because a future sample of data will have random variation in y.
- Therefore, prediction intervals will always be wider than confidence intervals.
- Which you use depends on your purpose. If you are trying to illustrate how the model fits, a confidence interval is fine. If you are trying to make a prediction about y given x, use a prediction interval.
Second, I invoked (for now) normality assumptions and computed AIC (or rather, used the computations that are built into lm()) for each model. The lowest AIC model is clearly the one with interaction between Time and Diet.

Note: The lm() and related procedures in R automatically and without warning produce p values, confidence intervals and AIC statistics. Again, these are based on MLE and normality assumptions. The least-squares estimates are unbiased without these assumptions, but understand that if you report p values and CIs or use AIC you are implicitly assuming normality, unless you make a different distributional assumption, which we're not allowed to do in the lm() function, but can in the glm() function, coming up next.

Next: Assignment

Page updated

Google Sites

Report abuse