Multicollinearity and confounding variables

The main idea...

If a set of variables is multicollinear, they are strongly intercorrelated and, from an information content point of view, redundant. This presents issues both during analysis and interpretation, particularly for explanatory variables. Including a set of highly inter-correlated explanatory variables in an analysis is often not advisable. For example, in regression analyses, the information gain associated with including all multicollinear variables (which are redundant) in an analysis is usually not worth the associated inflation of R2 values (which increase along with the number variables, regardless if those variables are good predictors of the response variable). During interpretation, it is generally not possible to say which of a set of multicollinear explanatory variables explains 

Detecting multicollinearity

Detecting multicollinearity is typically straightforward. Bivariate correlations between variables reveal strongly inter-correlated groups which should be examined further and handled appropriately (see below). Examining scatter plots is also recommended to ensure that relationships are indeed linear and correlation coefficients reasonable.

In some cases, significant correlations with other explanatory variables may not be detected. This does not mean that there is no significant multicollinearity. Performing a series of multiple linear regressions (MLRs), using one explanatory variable as a response variable and the rest as explanatory variables in the MLR model may reveal subtler forms of multicollinearity. An alternative approach, especially in the context of linear modelling, is to examine variance inflation factors (VIFs). VIFs indicate the proportion of variance in one explanatory variable that can be 'explained' by the variance of all other explanatory variables in a model. A VIF of "1" suggests no multicollinearity while higher values suggest larger degrees of multicollinearity.

Figure 1: A principal components analysis (PCA) suggesting some variables are highly correlated (*). The information these variables contains is thus likely to be redundant and their joint presence in many analyses may lead to distorted results. The multicollinearity observed here may be confirmed by examining bivariate plots and correlation statistics.

Dealing with multicollinearity

There are several ways to deal with multicollinear variables:

The (possible) benefits of multicollinearity

While their inclusion in analysis is often suspect, multicollinear variables can be useful when dealing with missing data. Multicollinear variables can act as auxiliary variables. That is, if one is missing a value, the values of its multicollinear values may be used to predict and impute the missing value using, for example, regression-based single imputation. See the missing data page for more information.

Implementations

MASAME multicollinearity detection app

    Click here to launch...