Residuals

The main idea...

Residuals (~ "leftovers") represent the variation that a given model, uni- or multivariate, cannot explain (Figure 1). In other words, residuals represent the difference between the predicted value of a response variable (derived from some model) and the observed value. Far from being uninteresting, examining the variation leftover after an analysis can reveal patterns and structures that may indicate whether the analysis method used was appropriate or whether the data is in need of transformation to suit the assumptions of a given analysis method  (e.g. Figure 2). When using multivariate methods, a matrix of residual values is typically produced after fitting a model.

Residuals in diagnostics

Residuals can be analysed as a means of diagnosing a model fit, that is, examining whether a model has satisfactorily summarised the variation in a data set or suffers from some shortcoming. Some basic strategies to examine residuals include:

As noted by Garcia-Berthou (2001) and Freckleton (2002) the analysis of residuals from multivariable data as a diagnostic tool is justified; however, using such residuals as a target of analysis (i.e. as 'data') when data sets have intercorrelated explanatory variables is generally not valid and will generate parameter estimates biased by the degree of intercorrelation. This also holds true when attempting to "control for" the effects of potentially confounding variables using residual analysis. See Koper et al. (2007) for an illustration of how residuals may lead to biased conclusions in ecological research. Standard multiple linear regression approaches do not produce biased estimates and take the effects of correlated explanatory variables into account simultaneously. However, multicollinearity will still be an issue when using MLR approaches, and while estimates may be unbiased, their associated variance will increase with increasing correlation between explanatory variables. See the endpoint for variation partitioning for descriptions of partial and semi-partial correlation approaches that can address shared variance between explanatory variables.

Warnings

Figure 1: Bi-variate plots with a simple linear model. Residuals (dashed lines) are the 'leftover' variation that the model cannot explain. a) The residuals appear to be randomly distributed, suggesting the model is appropriate; however data points with large residuals (asterisks) should be examined more closely b) the residuals appear to be structured and a linear model is not appropriate to describe this data.

Figure 2: Residual plots showing a) apparently random and b) non-random distributions of residuals. Random distributions indicate that the residual variation is due to random effects in the experiment or observation and that the model has performed reasonably well. If there is structure in the residuals, then the model has failed to capture some sort of variation in the data. An alternative model should be attempted. Autocorrelation in time or space is often a cause of non-random residual distribution.