Residuals
The main idea...
Residuals (~ "leftovers") represent the variation that a given model, uni- or multivariate, cannot explain (Figure 1). In other words, residuals represent the difference between the predicted value of a response variable (derived from some model) and the observed value. Far from being uninteresting, examining the variation leftover after an analysis can reveal patterns and structures that may indicate whether the analysis method used was appropriate or whether the data is in need of transformation to suit the assumptions of a given analysis method (e.g. Figure 2). When using multivariate methods, a matrix of residual values is typically produced after fitting a model.
Residuals in diagnostics
Residuals can be analysed as a means of diagnosing a model fit, that is, examining whether a model has satisfactorily summarised the variation in a data set or suffers from some shortcoming. Some basic strategies to examine residuals include:
Plotting residual values against the corresponding fitted values of a model. If there's a structured relationship between the value of the residual and that of the fitted value, there may be a risk of heteroscedasticity or more complex relationships that invalidate many model assumptions. The magnitude of the residual value should be independent from that of the fitted value.
Testing residual values for normality can validate whether the errors associated with a model are Guassian (an assumption of many linear methods). The Shapiro-Wilk test or the D'Agostino-Pearson test can be used here.
Testing residual values for correlation with (a matrix of) independent variables (including space and time) can reveal if there are structured relationships that have not been accounted for in the explanatory variables of a model.
Data points associated with large residuals, which may appear as outliers in the distribution, should be examined more closely.
As noted by Garcia-Berthou (2001) and Freckleton (2002) the analysis of residuals from multivariable data as a diagnostic tool is justified; however, using such residuals as a target of analysis (i.e. as 'data') when data sets have intercorrelated explanatory variables is generally not valid and will generate parameter estimates biased by the degree of intercorrelation. This also holds true when attempting to "control for" the effects of potentially confounding variables using residual analysis. See Koper et al. (2007) for an illustration of how residuals may lead to biased conclusions in ecological research. Standard multiple linear regression approaches do not produce biased estimates and take the effects of correlated explanatory variables into account simultaneously. However, multicollinearity will still be an issue when using MLR approaches, and while estimates may be unbiased, their associated variance will increase with increasing correlation between explanatory variables. See the endpoint for variation partitioning for descriptions of partial and semi-partial correlation approaches that can address shared variance between explanatory variables.
Warnings
Analysis software may deliver residuals as "raw" values or be standardised, scaled, or otherwise transformed. Other types of residuals, such as "leave-one-out" residuals, are also present. Be aware of what kind of values are delivered by your software and any potential consequences and assumptions associated with them.
Residuals have been used as a variable for further analysis. The validity of this approach must be considered carefully, as residuals have particular statistical properties. For example, see Freckleton (2002) and Garcia-Berthou E (2001).
Figure 1: Bi-variate plots with a simple linear model. Residuals (dashed lines) are the 'leftover' variation that the model cannot explain. a) The residuals appear to be randomly distributed, suggesting the model is appropriate; however data points with large residuals (asterisks) should be examined more closely b) the residuals appear to be structured and a linear model is not appropriate to describe this data.
Figure 2: Residual plots showing a) apparently random and b) non-random distributions of residuals. Random distributions indicate that the residual variation is due to random effects in the experiment or observation and that the model has performed reasonably well. If there is structure in the residuals, then the model has failed to capture some sort of variation in the data. An alternative model should be attempted. Autocorrelation in time or space is often a cause of non-random residual distribution.
References
Freckleton RP (2002) On the misuse of residuals in ecology: regression of residuals vs. multiple regression. J Anim Ecol. 71: 542–545.
Garcia-Berthou E (2001) On the misuse of residuals in ecology: testing regression residuals vs. the analysis of covariance. J Anim Ecol. 70: 708–711.
Koper N, Schmiegelow FKA, Merrill EH (2007) Residuals cannot distinguish between ecological effects of habitat amount and fragmentation: implications for the debate. Landsc Ecol. 22: 811–820.