Multicollinearity and confounding variables
The main idea...
If a set of variables is multicollinear, they are strongly intercorrelated and, from an information content point of view, redundant. This presents issues both during analysis and interpretation, particularly for explanatory variables. Including a set of highly inter-correlated explanatory variables in an analysis is often not advisable. For example, in regression analyses, the information gain associated with including all multicollinear variables (which are redundant) in an analysis is usually not worth the associated inflation of R2 values (which increase along with the number variables, regardless if those variables are good predictors of the response variable). During interpretation, it is generally not possible to say which of a set of multicollinear explanatory variables explains
Detecting multicollinearity
Detecting multicollinearity is typically straightforward. Bivariate correlations between variables reveal strongly inter-correlated groups which should be examined further and handled appropriately (see below). Examining scatter plots is also recommended to ensure that relationships are indeed linear and correlation coefficients reasonable.
In some cases, significant correlations with other explanatory variables may not be detected. This does not mean that there is no significant multicollinearity. Performing a series of multiple linear regressions (MLRs), using one explanatory variable as a response variable and the rest as explanatory variables in the MLR model may reveal subtler forms of multicollinearity. An alternative approach, especially in the context of linear modelling, is to examine variance inflation factors (VIFs). VIFs indicate the proportion of variance in one explanatory variable that can be 'explained' by the variance of all other explanatory variables in a model. A VIF of "1" suggests no multicollinearity while higher values suggest larger degrees of multicollinearity.
Figure 1: A principal components analysis (PCA) suggesting some variables are highly correlated (*). The information these variables contains is thus likely to be redundant and their joint presence in many analyses may lead to distorted results. The multicollinearity observed here may be confirmed by examining bivariate plots and correlation statistics.
Dealing with multicollinearity
There are several ways to deal with multicollinear variables:
Rescaling variables through standardising data transformation may reduces or eliminate scale-dependent multicollinearity (i.e. correlation created due to variation introduced by the scale the variables rather than the variation of the variable values themselves).
Deletion: Any variables which are deemed less informative or important than the others in a multicollinear group may be deleted. This is usually acceptable during an exploratory exercise with "bulk data". If analysing data designing a sampling campaign to test specific explanatory variables, this should be done with caution and not simply for statistical convenience.
Aggregation: a new variable can be declared which somehow aggregates or combines multicollinear variables. This may be done by simply declaring a "representative" variable of a group or by using mathematical operations such as factor analysis. Regardless of which method is used, there should be sound ecological or biological (rather than 'only' statistical) reasoning behind it.
The (possible) benefits of multicollinearity
While their inclusion in analysis is often suspect, multicollinear variables can be useful when dealing with missing data. Multicollinear variables can act as auxiliary variables. That is, if one is missing a value, the values of its multicollinear values may be used to predict and impute the missing value using, for example, regression-based single imputation. See the missing data page for more information.
Implementations
MASAME multicollinearity detection app