Multiple Linear Regression

[under construction]

The thoughts below are preliminary thoughts on multiple linear regression based on a note written to a friend on the topic. These are by no means comprehensive, but might provide some initial direction as a start. More will be added as time permits.

Multivariate regression. This is a bit different from other multivariate methods where there are more than one response variable (e.g., factor analysis, principle components analysis, repeated measures ANOVA, multivariate ANOVA, canonical correlation, etc. etc.).

Here are some thoughts:

First, be aware of any assumption that the relationship is additive, implying linear regression (here, multiple linear regression). This isn't unusual, but just be aware that there are other types of models that are sometimes more appropriate depending on what is being modeled.

Note: "linear" in the term linear regression has little to do with straight lines. Instead, it means that the model describing the expected value of the response variable is "linear in the betas". For more info, see:

www.statwiki.net/main_page/methods/regression/regression-supporting-concepts/definitions-of-linear

When using regression methods, checking for relationships between explanatory variables is a good thing. If you include two or more explanatory variables that are highly correlated in the same regression model, this is known as 'multicollinearity'. It can cause some of the matrix math in the algorithms to break down (causes the determinant in the denominator to be equal to zero).

A diagnostic check for multicollinearity is VIF (Variance Inflation Factor). Look it up and use it.

Selecting explanatory variables:

There are model selection routines that can be helpful, but they are also easily mis-used so be careful. These include stepwise selection, forward selection, backward removal, best subsets, and some others. Most statistical packages offer these.

There are good metrics for the fit of the model. Most people only check R-square, but this is only one indicator, and it isn't always the best indicator of model fit. The best way to check model fit is to check the model assumptions, which means checking the model residuals for several things.

For ANY statistical model, including regression models, you NEED TO CHECK model assumptions. This means several things, but a big part of this is checking model residuals. For regression see:

https://sites.google.com/a/crlstatistics.net/crlstatwiki/main_page/methods/regression/linear-regression

and

https://sites.google.com/a/crlstatistics.net/crlstatwiki/main_page/methods/regression/linear-regression/regression-diagnostics---residuals

Also, remember to check for autocorrelation. If found, consider using a time series method.

Also keep in mind that sometimes there are interaction terms that are important. If the explanatory variables are P1, P2, P3, P4, P5, (etc.), then an interaction term could be P2*P5.

Explanatory variables can be continuous data, or they can be categorical data. If you have a mix, you may want to read this:

https://sites.google.com/a/crlstatistics.net/crlstatwiki/main_page/methods/regression/regression-supporting-concepts/qualitative-variables

and this

https://sites.google.com/a/crlstatistics.net/crlstatwiki/main_page/methods/regression/linear-regression/regression-equality-of-slopes-model