Multiple linear regression

The main idea...

Multiple linear regression (MLR) aims to quantify the degree of linear association between one response variable and several explanatory variables (Equation 1; Figure 1). Unlike correlation, regression asserts that there is a directional, causal relationship between response and explanatory variables. Thus, changes in the explanatory variables can be used to predict changes in the response variable, but not vice versa. Attempting to perform reverse regression is likely to be problematic (see Greene, 1984 for illustration).

Alternatively, MLR may be used to answer general questions of the kind:

    "Is there a significant, linear relationship between my response variable and one or more of my explanatory variables?"

If you wish to perform MLR on (dis)similarity matrices, consider multiple regression on (dis)similarity matrices (MRM).

Equation 1: The general MLR equation. A response variable (y; known as the regressand) is predicted by a number of explanatory variables (x1, x2 ... xn; the regressors). The strength of each regressor's effect on the response variable is determined by the regression coefficients β1 ... βn. Together, the linear combinations of the regression coefficients and explanatory variables are able to predict the values of y with some error, ε. The error term can be understood as the value of y when all βixi, for i = {1 ... n}, are equal to zero.

Results and interpretation

Most MLR implementations will deliver:

Values of immediate interest are typically the adjusted R2, the p-value of the F statistic, and information about the distribution of residuals. If a significant linear fit has been detected and the residuals are normally distributed and centred on zero, one may then examine the individual coefficients associated with each explanatory variable along with their p-values. Assuming there are significant explanatory variables in the MLR model, values of the response variable may be estimated by multiplying this set of explanatory variables values by their coefficients and taking their sum along with the model's intercept value (as in Equation 1). Individual coefficients suggest the magnitude of change in the response variable per unit change of the explanatory variable. Standardised coefficients express this relationship in terms of standard deviations.

Figure 1: General concepts of MLR. a) MLR regresses a response variable (y) on multiple explanatory variables (x1 ... xn). That is, values of x1...xn will be used to predict values of y through a linear function. b) An schematic representation of an MLR involving one response variable (y) and two explanatory variable (x1, x2). As y (the vertical axis) is being regressed on x1 and x2, the vertical distances (dashed lines) between objects (red filled circles; diameter is inversely proportional to distance from the viewer) and a plane-of-best-fit are the residuals of the regression depicted here. "Best fit" is determined by a loss function, usually attempting to minimise the sum of squared residuals. In cases where more explanatory variables are analysed, a hyperplane of best fit is needed.    

Interactions between explanatory variables

In addition to simple relationships between the response and explanatory variables (Equation 1), the response variable's linear relation to interactions between explanatory variables may also be assessed. An interaction term between explanatory variables x1 and x2, for example, would appear as "βx1x2" in Equation 1 and is referred to as a second-order interaction term as the sum of the interacting variable's exponents is two. Specifying interactions may be useful if there is reason to believe that a set of explanatory variables influence one another and hence the fit to the response variable. For example, if the abundance of a zooplankton taxon (y) is thought to be influenced by the abundance of a phytoplankton taxon that is its primary prey (x1), the concentration of nitrate in the water column (x2), and the response of the phytoplankton to the nitrate concentration (x1 × x2), then the MLR model would be: y =  βx1 + βx2 + βx1x2.

If an interaction term is found to be significant, is deemed substantively meaningful, and you wish to retain it in a model, include all lower-order permutations of that term even if they are not significant. For example, including a third-order term x1 × x2 × x3 would require you to retain the second-order terms x1 × x2, x2 × x3, and x1 × x3, and the constituent first-order terms x1, x2, and x3.

Variable selection in MLR

Once an MLR model has been generated and the results examined, explanatory variables may be added or removed to fine tune the model for better fit. The addition or removal of variables should be guided by domain knowledge rather than the results of statistical tests for a range of reasons described in the references below. Users are encouraged to read Whittingham et al., (2006) for an approachable summary. Selection based on the results of statistical model evaluation should be handled with great care: even if a model produces more significant coefficients and higher R2 values, there's no guarantee that it's the best model (or even a realistic model) for your system. The resulting model may simply be very well tuned to your data set, with no generalisability to other, related ecological systems.

Domain knowledge must guide decisions to add or remove variables. However, when there are many variables available or no a priori reasons to believe the inclusion or removal of one explanatory variable is more sensible than that of another (as in data-mining exercises), automated approaches to assess which combination of variables offers the best fit (which is not necessarily the best explanation of the response data) are available. Below, forward, backward, and all-possible-regression methods are described. 

Warning:  As discussed and demonstrated by Derksen & Keselman (1992), Harrel (2001), Whittingham et al., (2006) and many others, stepwise procedures have multiple weaknesses, including biased parameter estimates, over-fitting, and inherent multiple testing. Further, forward selection approaches for redundancy analysis (RDA) are described by Blanchet et al. (2008).  Please consult these references before considering stepwise methods.

The following automated procedures involve either adding or removing variables from a model based on the effect those variables have on the overall performance of the model. Performance may be evaluated in several ways, including F-tests, the Akaike or Bayesian information criteria (AIC, BIC, resp.), or by calculating Mallows's Cp. Each of these criteria have different properties, thus this choice should be well-informed.

Forward selection

The forward selection algorithm begins with an empty model and then adds explanatory variables based on the values the chosen criterion would have if they were included in the model. The variable with the 'best' impact on the criterion value is added to the model and the process repeated. Note that each time a variable is added to the model, the potential criterion values of the other variables change. The process terminates when the addition of any more variables would not lead to a sufficiently 'better' criterion value. The sufficiency of improvement is defined by a threshold which may be set by the user in many implementations. 

Backward selection

The backward selection algorithm begins with a model built with all the explanatory variables available. It then removes explanatory variables if this would sufficiently improve the value of the criterion chosen. The sufficiency of improvement is defined by a threshold which may be set by the user in many implementations. The process terminates when further removal would not sufficiently improve the selected criterion.

Bidirectional selection

This approach is a combination of forward and backward selection and will either add or remove variables based on the relative improvement such an action would have on the evaluation criterion. 

Once again, methods that attempt to use automatic selection procedures to add or remove variables from an MLR model should be treated with caution and always earmarked for validation on new data sets. 

Key assumptions

Warnings

Implementations

References