regression Models

Why is the expected value of whatever you've measured the mean of your linear model? Elementary assumptions, my dear Watson! An illustration of why checking that you meet the assumptions of linear model is important...

For a model where you have factors x_i to explain what you measured y_i (Google sites please incorporate LaTex or HTML editing...), ie. y_i given x_i, i.e.

E[y_i | X_i = x_i] = E[ beta_0 + beta_i * x_i + e_i] substituting linear model in

= E[beta_0] + E[beta_i * x_i] + E[e_i]

= beta_0 + beta_i * x_i + E[e_i] betas are not random

= beta_0 + beta_i * x_i + 0 we assumed e ~ N(0, sigma)

= mu_i i.e. the mean!

Similar reason for why Var(Y) = sigma^2:

Var[y_i | X_i = x_i] = Var[ beta_0 + beta_i * x_i + e_i] substituting linear model in

= constant + E[e_i] betas are not random

= sigma^2 we assumed e ~ N(0, sigma)

Should I center/scale or normalize the predictors?

Centering or normalizating predictors does not change the interpretation of regression models, nor RMSE, R-squared value, adjusted R-squared value, or p-value of coefficients -- but it naturally affects estimation (coefficient values).

NOTE: Scaling or normalizing the response/dependent variable is unnecessary because we are interested in comparing the weights of the independent variables.

You should center or nomalize

for interpretability of model estimations/predictions: when 0 for a predictor has no meaning or you want to know predictions in terms of distance (e.g. no one who weights 0 kg)
when the independent continuous variables are on different scales, which would bias certain variables to have more contribution to the analysis. Rescaling promotes equal range and variance.
to reduce collinearity introduced by higher order terms (enough to just center) -- see that VIFs will be less than 5.
- why this works...Find geometric explanation for centering and orthogonalizing....and how centering has the same effect as normalizing....
- example when multicollinearity is induced
  - when your response is nonlinear/you're modeling with quadratic terms
    - illustration on this forum post illustrating how centering can affect estimation linear regression using R's lm even when there is no interaction
  - when you have interaction terms
when you use gradient descent to fit your model
- standardizing covariates may speed up convergence (otherwise certain variables may inappropriately dominate the gradient).

Check out Camelia Simoiu and Jim Savage's informative explanation of scaling and its effect on estimation (and what to do when you have scaling probs) on RPubs or AFNI's wonderfully explanatory section on centering variables in their documentation.

Fun fact: 𝑋′𝑋 of the centered and standardized 𝑋 is the correlation matrix.

Methods of scaling/normalization

How do you center? Up to you! Well, it has to make sense. You can choose the grand mean of everyone and all the levels of a factor, or you can choose the mean of a reference group. Some options as listed on Listen Data

z-score = (x - mean) / stdDev
rescale an original variable to have a mean of zero and standard deviation of one
R:
X.scaled <- scale(X, center= TRUE, scale=TRUE)
colMeans(X.scaled) ## should all be 0
var(X.scaled) ## should all be 1
Min-Max Scaling / 0-1 scaling = x-min(x)/(max(x)-min(x))
standardized value lies between 0 and 1
R:
library(dplyr)
mins <- as.integer(summarise_all(X, min))
rng <- as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled <- data.frame(scale(X, center= mins, scale=rng))
summarise_all(X.scaled, funs(min, max)) ## check -- min = 0, max = 1
Standard Deviation Method = x/stdev(x)
results in equal variance but different means and ranges.
R:
X.scaled = data.frame(scale(X, center= FALSE , scale=apply(X, 2, sd, na.rm = TRUE)))
summarise_all(X.scaled, var) ## check: should have eq var
Range Method = x /(max(x) - min(x))
ranges more similar but means, variances, and ranges of the variables are different
R:
library(dplyr)
rng = as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled = data.frame(scale(X, center= FALSE, scale=rng))
summarise_all(X.scaled, var)

Should I transform predictors or the dependent variables?

summarized from responses to this stackexchange Q.

When it's helpful to transform (usually with some nonlinear function like taking the log) the dependent variable:

residuals have a skewed distribution and transforming makes the residuals more symmetrically distributed about zero
transforming removes systematic changes in spread of the residuals with the dependent variable (i.e. makes residuals from being heteroscedastic to homoscedastic)
linearize a relationship to do linear modeling
to simplify the model, e.g. simplify interaction terms or how many there are
for interpretation/"convenience"
- e.g. if you log both your dependent (Y) and independent (X) variable(s), your regression coefficients (b) will be elasticities and the interpretation would be (also see UCLA's FAQ on this and their more detailed page on interpreting log transforms):
  - Y and X: a one unit increase in X would lead to a b increase/decrease in Y
  - log(Y) and log(X): a 1% increase in X would lead to a b% increase/decrease in Y
  - log(Y) and X: a one unit increase in X would lead to a b*100% increase/decrease in Y
  - Y and log(X): a 1% increase in X would lead to a b/100 unit increase/decrease in Y
theoretical reasons to do so, e.g.
- pH concentration is on a logarithmic scale
- the Cobb-Douglas production function which explains how inputs turn into outputs:
  Y = A * L^alpha * K ^beta, where
  Y = total production of some entity
  A = total factor productivity
  L = labor input
  K = capital input
  alpha, beta = output elasticities
  Taking the log of this makes the function easy to estimate using OLS linear regr:
  log(Y) = log(A) + alpha * log(L) + beta * log(K)

What sort of transform? Check out John Tukey's book Exploratory Data Analysis where he gives some quantitative ways to estimate the transformation based on rank statistics on the residuals

use logarithmic when
- residuals are strongly positively skewed (taking the log can symmetrize them)
- when the SD of the residuals is directly proportional to the fitted values and not to some power of the fitted values
- when the relationship is close to exponential
- when residuals multiplicatively accumulate errors
- you want to model where the marginal changes in the explanatory variables are interpreted in terms of multiplicative (percentage) changes in the dependent variables

use inversion when
- variable has a negative skew (like Likert scales)

Bad reasons to transform:

to make bad data well-behaved, e.g. to make outliers not look like outliers
- First have a scientifically valid and statistically good description of the data -- then look at outliers. Otherwise it's like letting the occasional outlier determine how to look at the rest of the data.
because all the data is positive (actually taking the root works best with counted data)
to be able to plot the data

Error term

Maximal or minimal?

Building the model/terms to include

https://www.researchgate.net/post/When-is-it-correct-to-omit-a-random-factor

What distribution function should I model with?

https://stats.stackexchange.com/questions/58220/what-distribution-does-my-data-follow

fitdistrplus::descdist() to graphically help choose likely distribution candidates

R regression outputs

What is with the "1"/"2" for factorX in the results table, i.e. factorX1/factorX2?

Default R regression results output isn't good at naming the levels of your factors. If you used the default dummy/treatment coding scheme, the intercept is the reference level of factorX -- which by default is chosen in alphabetical order -- and with all other factors at reference level, too. factorX1 denotes the next level (usually next alphabetically), factorX2 the level after that, etc. For example, for the iris dataset, iris$Species = {setosa, versicolor, virginica}. If you use R's default dummy coding,

Package car does a better job. When you set the contrast coding scheme, use car's functions: contr.Treatment() instead of the base function contr.treatment(), contr.Sum() instead of contr.sum(), e.g.

library("car")

contrasts(iris$Species) <- contr.Treatment(levels(iris$Species))

This will make label the parameter estimates in the results table by text-based level names rather than 1, 2, ....

Terminology

random vs. fixed variables

These terms refer to different things depending on the field. As summarized in Stephanie Glen's StatisticsHowTo.com,

Summary of diff definitions:

https://statmodeling.stat.columbia.edu/2005/01/25/why_i_dont_use/

https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode

MODEL COMPARISONS

ANOVA vs. Regression

Notes from Jonathan Templin's U of Georgia ERSH 8310 class and forum posts on stats.stackexchange.com

ANOVA is a regression model in which the dependent variable is quantitative but all the explanatory variables are dummies (qualitative):

regrModComp

Correlation vs. Regression

corrVSregr

Different regression models

simple linear regression vs. multiple linear regression vs. mixed regression models vs. generalized linear models

Workflow

Zuur

Assumptions

Linear regression

Increasing the strictness of assumptions leads to stronger results -- check out Ben's response to this CrossValidated post Where does the misconception that Y must be normally distributed come from?

Y values can be expressed as a LINEAR FUNCTION of the x var
- to test: P-P plot
Y-values (or errors) are INDEPENDENT (no autocorrelation)
- to test:

fit.lm <- lm(y~x, data = data)

durbinWatsonTest(fit.lm, alternative="two.sided",data=data) # H0 = not indep

the predictor variables are not correlated (not multicollinear)
- Two types of multicollinearity:
  - data collinearity: the multicollinearirty is in the data itself
  - structural collinearity: multicollinearity due to how model is specified (e.g. if square a term, X and X2 will be correlated)
- Why a problem
  - coefficient estimates become oversensitive to the other estimates
  - model results not reliable: reduces the precision of the estimated coefficients and p-value estimates
  - makes it hard to interpret coefficients
- tests
  - variance inflation factors (VIF; identifies correlation between indep var and the strength of the correlation; VIF [1, inf]).
    - VIF > 5 = critical
- to fix:
  - center the independent variables
  - remove some of the highly correlated indep variables
    - partial least squares regression (uses PCA to create a set of uncorrelated components)
  - LASSO or ridge regression (forms of regression analysis that can handle multicollinearity)
HOMOSKEDASTICITY: Variation of observations around the regression line (the residual SE) is constant, i.e. variance is not a function of x
- Skedasticity characterizes relationship between residual size and predictions
- to test:
  - scatter plot
  - residuals (errors) vs. fitted plot

plot(fit.lm)

# alternatively

plot(x, resid(fit.lm)

abline(h=0)

- - interpretation:
    - good: should look like a cloud of points
    - bad: looks like a megaphone (variance changes with x), not curved (indicates nonlinearity), no other patterns
  - Levene Test for equal variance (H0 = unequal var)

leveneTest(y ~ x1 * x2, data=data)

- - Shapiro-Wilk normality test (H0 = not normal)

shapiro.test(residuals(fit.lm))

- note: There can be heteroskedasticity on predicted values even if the residuals are normally distributed.

CONDITIONAL NORMALITY: For a given x, y values (or the error) are normally distributed
- i.e. Y|X∼N(β0+β1X,σϵ), or in STAN language:

model {

vector[n_obs] yhat;

for(i in 1:n_obs) {

yhat[i] = beta[1] + beta[2] * x1[i] + beta[3] * x2[i];

}

y ~ normal(yhat, sigma);

}

- This does not
  - characterize how residuals relate to anything
  - does not mean that the marginal Y must be normally distributed
- to test:
  - histogram
  - normal Q-Q plot: standardized residuals (ordered, observed residuals) vs. theoretical quantiles (ordered, theoretical residuals)

par( mfrow = c(2,2) )
plot(fit.lm)

Kolmogorov-Smirnov test
Shapiro-Wlik test
kurtosis
distribution skew

Model evaluation

Check assumptions
Check for bias in coefficient estimates/predictions
- Residual vs. predicted/fitted plot
  plot(data)
  abline(lmMod) #Add a regression line
Check goodness-of-fit statistics
- R-squared = explained_variation / total_variation
  - [0%, 100%]
  - measure of the spread of points around a regression line estimated from a given sample, i.e. how much variability in the response variable has been explained by the regression equation fitted from a given sample
  - does not indicate whether model is adequate
    - low R-squared does NOT automatically mean bad model
    - can get high R-squared for bad model fit
    - but often true that a high R2 results in small standard errors and high coefficients
  - useful when comparing two different models with the same response variable but different predictor variables
Check for outliers
For example, plot Cook's distances (large values indicate outliers)
plot(cooks.distance(lmMod)
Check that standard errors (estimates of variance of regression coefficients) across a sample are small relative to the coefficients

Residual diagnostics

Resources

Nice explanation of diagnosing models from residual plots and problem fixes from qualtrics.XM
Create productivity-increasing, fancy residual-to-fitted plots from BlogR Simon Jackson (@drsimonj)
more on plotting residuals using ggplot from fishR

Random Error term

Understanding R outputs

Interpreting
summary(lmMod)

R regression outputs are divided into

Call (what formula was used to fit model)
Residuals
- can be manually calculated with
  summary(data$depVar - model$fitted.values)
- check for
  - Median ~= 0 => residuals symmetrical / model predicts high and low end of data range equally well
    - if median < 0 => right-skewed so model does not predict higher-end of data range as well as lower end / QQ-plot for higher end will not fall on diagonal
Coefficients
- for simple regression outputs, each indep var, reports estimate, standard error, t-statistic (= estimate/stdErr), p-value (Pr(>|t|) are p-values based on t-distribution)
- intercept = mean of the reference levels of all variables
- each coefficient is the change in dep var with each unit change (or categ change if dummy-coded categorical var) of that indep var.
  - If a categ var, prepend by "is"
Residual standard error
- how well the model fits the data (how much estimates deviate from the real data points)
Multiple R-squared and Adjusted R-squared
- what percentage of the variation within our dependent variable that the independent variable is explaining
- adjusted R-squared adds a penalty for additional predictors and favored in multiple regression models because R-squared will always increase as more variables are included in the model
- PITFALL: a high value of R-squared is not correlated with a more important finding (or low R-squared low importance)
F-statistic and p-value
- for overall model -- null hypothesis is that the coefficients for all of the variables in the model = zero

Evaluating importance of coefficients

So far from Statistics by Jim

Larger unstandardized coefficients don’t necessarily represent more important independent variables.

Coefficient values are linked to the units they are in and whether they are centered/standardized so a small coefficient value has no bearing on how important it is.

However, coefficients of standardized variables, which are unitless, can be used like standardized effect size because they indicate the strength of the relationship between variables. When all continuous variables are standardized, the one with the largest absolute value is the most variable that explains the data variance the most.

Low p-values don’t necessarily represent independent variables that are practically important.

The heading says it all.

The independent variable that leads to the largest R-squared increase when it is the last variable added is the explains the most variance of the data.

R-squared ( = explained variation of the model / total variation of the model) reflects the goodness-of-fit of the model. Therefore, the change in R-squared in adding a given independent variable represents the unique portion of the goodness-of-fit that is attributable only to that independent variable

Reporting results

Additional Concepts

Coding schemes / Contrast types

contrast types = coding schemes = parameterization styles

regrModComp

Analyzing Likert scale data

LIkert scale data is ordinal, discrete data with limited range but cannot necessarily be considered interval data, which would mean that the points are equally spaced. There is no consensus on how to analyze it with some saying ANOVA is robust enough to handle it and others saying it's a big no-no and say a non-parametric equivalent like Mann-Whitney or Kruskal Wallis is more appropriate (although these also assume variables to have continuous distribution functions, like ANOVA). Similarly, some say it can be treated like normally distributed and a standard linear regression is fine whereas others advocate that an ordered logistic regression or multinomial logistic regression is more appropriate.

Winter and Dodou (2010, Practical Assess, Res Eval) used simulation tests to compare 2-sample t-test and Mann-Whitney on Likert data test and found that they are produce comparable false positive rates.

Some discussion on P.Mean and sample comparison of approaches on Salvatore's Mangiafico's rcompanion.org, numerous discussions on StackExchange including this one, and using ANOVA with Likert scale data

Resources/Guides

Andrew Lapointe's authoR Reproducible Science for Busy Researchers with loads of experience-laden tips about data organization and analysis, maximize R, data exploration and mixed regression modeling.

Page updated

Report abuse