Why is the expected value of whatever you've measured the mean of your linear model? Elementary assumptions, my dear Watson! An illustration of why checking that you meet the assumptions of linear model is important...
For a model where you have factors x_i to explain what you measured y_i (Google sites please incorporate LaTex or HTML editing...), ie. y_i given x_i, i.e.
E[y_i | X_i = x_i] = E[ beta_0 + beta_i * x_i + e_i] substituting linear model in
= E[beta_0] + E[beta_i * x_i] + E[e_i]
= beta_0 + beta_i * x_i + E[e_i] betas are not random
= beta_0 + beta_i * x_i + 0 we assumed e ~ N(0, sigma)
= mu_i i.e. the mean!
Similar reason for why Var(Y) = sigma^2:
Var[y_i | X_i = x_i] = Var[ beta_0 + beta_i * x_i + e_i] substituting linear model in
= constant + E[e_i] betas are not random
= sigma^2 we assumed e ~ N(0, sigma)
Centering or normalizating predictors does not change the interpretation of regression models, nor RMSE, R-squared value, adjusted R-squared value, or p-value of coefficients -- but it naturally affects estimation (coefficient values).
NOTE: Scaling or normalizing the response/dependent variable is unnecessary because we are interested in comparing the weights of the independent variables.
You should center or nomalize
for interpretability of model estimations/predictions: when 0 for a predictor has no meaning or you want to know predictions in terms of distance (e.g. no one who weights 0 kg)
when the independent continuous variables are on different scales, which would bias certain variables to have more contribution to the analysis. Rescaling promotes equal range and variance.
to reduce collinearity introduced by higher order terms (enough to just center) -- see that VIFs will be less than 5.
why this works...Find geometric explanation for centering and orthogonalizing....and how centering has the same effect as normalizing....
example when multicollinearity is induced
when your response is nonlinear/you're modeling with quadratic terms
illustration on this forum post illustrating how centering can affect estimation linear regression using R's lm even when there is no interaction
when you have interaction terms
when you use gradient descent to fit your model
standardizing covariates may speed up convergence (otherwise certain variables may inappropriately dominate the gradient).
Check out Camelia Simoiu and Jim Savage's informative explanation of scaling and its effect on estimation (and what to do when you have scaling probs) on RPubs or AFNI's wonderfully explanatory section on centering variables in their documentation.
Fun fact: 𝑋′𝑋 of the centered and standardized 𝑋 is the correlation matrix.
How do you center? Up to you! Well, it has to make sense. You can choose the grand mean of everyone and all the levels of a factor, or you can choose the mean of a reference group. Some options as listed on Listen Data
z-score = (x - mean) / stdDev
rescale an original variable to have a mean of zero and standard deviation of one
R:
X.scaled <- scale(X, center= TRUE, scale=TRUE)
colMeans(X.scaled) ## should all be 0
var(X.scaled) ## should all be 1
Min-Max Scaling / 0-1 scaling = x-min(x)/(max(x)-min(x))
standardized value lies between 0 and 1
R:
library(dplyr)
mins <- as.integer(summarise_all(X, min))
rng <- as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled <- data.frame(scale(X, center= mins, scale=rng))
summarise_all(X.scaled, funs(min, max)) ## check -- min = 0, max = 1
Standard Deviation Method = x/stdev(x)
results in equal variance but different means and ranges.
R:
X.scaled = data.frame(scale(X, center= FALSE , scale=apply(X, 2, sd, na.rm = TRUE)))
summarise_all(X.scaled, var) ## check: should have eq var
Range Method = x /(max(x) - min(x))
ranges more similar but means, variances, and ranges of the variables are different
R:
library(dplyr)
rng = as.integer(summarise_all(X, function(x) diff(range(x))))
X.scaled = data.frame(scale(X, center= FALSE, scale=rng))
summarise_all(X.scaled, var)
summarized from responses to this stackexchange Q.
When it's helpful to transform (usually with some nonlinear function like taking the log) the dependent variable:
residuals have a skewed distribution and transforming makes the residuals more symmetrically distributed about zero
transforming removes systematic changes in spread of the residuals with the dependent variable (i.e. makes residuals from being heteroscedastic to homoscedastic)
linearize a relationship to do linear modeling
to simplify the model, e.g. simplify interaction terms or how many there are
for interpretation/"convenience"
e.g. if you log both your dependent (Y) and independent (X) variable(s), your regression coefficients (b) will be elasticities and the interpretation would be (also see UCLA's FAQ on this and their more detailed page on interpreting log transforms):
Y and X: a one unit increase in X would lead to a b increase/decrease in Y
log(Y) and log(X): a 1% increase in X would lead to a b% increase/decrease in Y
log(Y) and X: a one unit increase in X would lead to a b*100% increase/decrease in Y
Y and log(X): a 1% increase in X would lead to a b/100 unit increase/decrease in Y
theoretical reasons to do so, e.g.
pH concentration is on a logarithmic scale
the Cobb-Douglas production function which explains how inputs turn into outputs:
Y = A * L^alpha * K ^beta, where
Y = total production of some entity
A = total factor productivity
L = labor input
K = capital input
alpha, beta = output elasticities
Taking the log of this makes the function easy to estimate using OLS linear regr:
log(Y) = log(A) + alpha * log(L) + beta * log(K)
What sort of transform? Check out John Tukey's book Exploratory Data Analysis where he gives some quantitative ways to estimate the transformation based on rank statistics on the residuals
use logarithmic when
residuals are strongly positively skewed (taking the log can symmetrize them)
when the SD of the residuals is directly proportional to the fitted values and not to some power of the fitted values
when the relationship is close to exponential
when residuals multiplicatively accumulate errors
you want to model where the marginal changes in the explanatory variables are interpreted in terms of multiplicative (percentage) changes in the dependent variables
use inversion when
variable has a negative skew (like Likert scales)
Bad reasons to transform:
to make bad data well-behaved, e.g. to make outliers not look like outliers
First have a scientifically valid and statistically good description of the data -- then look at outliers. Otherwise it's like letting the occasional outlier determine how to look at the rest of the data.
because all the data is positive (actually taking the root works best with counted data)
to be able to plot the data
https://www.researchgate.net/post/When-is-it-correct-to-omit-a-random-factor
https://stats.stackexchange.com/questions/58220/what-distribution-does-my-data-follow
fitdistrplus::descdist() to graphically help choose likely distribution candidates
Default R regression results output isn't good at naming the levels of your factors. If you used the default dummy/treatment coding scheme, the intercept is the reference level of factorX -- which by default is chosen in alphabetical order -- and with all other factors at reference level, too. factorX1 denotes the next level (usually next alphabetically), factorX2 the level after that, etc. For example, for the iris dataset, iris$Species = {setosa, versicolor, virginica}. If you use R's default dummy coding,
Package car does a better job. When you set the contrast coding scheme, use car's functions: contr.Treatment() instead of the base function contr.treatment(), contr.Sum() instead of contr.sum(), e.g.
library("car")
contrasts(iris$Species) <- contr.Treatment(levels(iris$Species))
This will make label the parameter estimates in the results table by text-based level names rather than 1, 2, ....
These terms refer to different things depending on the field. As summarized in Stephanie Glen's StatisticsHowTo.com,
Summary of diff definitions:
https://statmodeling.stat.columbia.edu/2005/01/25/why_i_dont_use/
Notes from Jonathan Templin's U of Georgia ERSH 8310 class and forum posts on stats.stackexchange.com
ANOVA is a regression model in which the dependent variable is quantitative but all the explanatory variables are dummies (qualitative):
simple linear regression vs. multiple linear regression vs. mixed regression models vs. generalized linear models
Zuur
Increasing the strictness of assumptions leads to stronger results -- check out Ben's response to this CrossValidated post Where does the misconception that Y must be normally distributed come from?
Y values can be expressed as a LINEAR FUNCTION of the x var
to test: P-P plot
Y-values (or errors) are INDEPENDENT (no autocorrelation)
to test:
fit.lm <- lm(y~x, data = data)
durbinWatsonTest(fit.lm, alternative="two.sided",data=data) # H0 = not indep
the predictor variables are not correlated (not multicollinear)
Two types of multicollinearity:
data collinearity: the multicollinearirty is in the data itself
structural collinearity: multicollinearity due to how model is specified (e.g. if square a term, X and X2 will be correlated)
Why a problem
coefficient estimates become oversensitive to the other estimates
model results not reliable: reduces the precision of the estimated coefficients and p-value estimates
makes it hard to interpret coefficients
tests
variance inflation factors (VIF; identifies correlation between indep var and the strength of the correlation; VIF [1, inf]).
VIF > 5 = critical
to fix:
center the independent variables
remove some of the highly correlated indep variables
partial least squares regression (uses PCA to create a set of uncorrelated components)
LASSO or ridge regression (forms of regression analysis that can handle multicollinearity)
HOMOSKEDASTICITY: Variation of observations around the regression line (the residual SE) is constant, i.e. variance is not a function of x
Skedasticity characterizes relationship between residual size and predictions
to test:
scatter plot
residuals (errors) vs. fitted plot
plot(fit.lm)
# alternatively
plot(x, resid(fit.lm)
abline(h=0)
interpretation:
good: should look like a cloud of points
bad: looks like a megaphone (variance changes with x), not curved (indicates nonlinearity), no other patterns
Levene Test for equal variance (H0 = unequal var)
leveneTest(y ~ x1 * x2, data=data)
Shapiro-Wilk normality test (H0 = not normal)
shapiro.test(residuals(fit.lm))
note: There can be heteroskedasticity on predicted values even if the residuals are normally distributed.
CONDITIONAL NORMALITY: For a given x, y values (or the error) are normally distributed
i.e. Y|X∼N(β0+β1X,σϵ), or in STAN language:
model {
vector[n_obs] yhat;
for(i in 1:n_obs) {
yhat[i] = beta[1] + beta[2] * x1[i] + beta[3] * x2[i];
}
y ~ normal(yhat, sigma);
}
This does not
characterize how residuals relate to anything
does not mean that the marginal Y must be normally distributed
to test:
histogram
normal Q-Q plot: standardized residuals (ordered, observed residuals) vs. theoretical quantiles (ordered, theoretical residuals)
par( mfrow = c(2,2) )
plot(fit.lm)
Kolmogorov-Smirnov test
Shapiro-Wlik test
kurtosis
distribution skew
Check assumptions
Check for bias in coefficient estimates/predictions
Residual vs. predicted/fitted plot
plot(data)
abline(lmMod) #Add a regression line
Check goodness-of-fit statistics
R-squared = explained_variation / total_variation
[0%, 100%]
measure of the spread of points around a regression line estimated from a given sample, i.e. how much variability in the response variable has been explained by the regression equation fitted from a given sample
does not indicate whether model is adequate
low R-squared does NOT automatically mean bad model
can get high R-squared for bad model fit
but often true that a high R2 results in small standard errors and high coefficients
useful when comparing two different models with the same response variable but different predictor variables
Check for outliers
For example, plot Cook's distances (large values indicate outliers)
plot(cooks.distance(lmMod)
Check that standard errors (estimates of variance of regression coefficients) across a sample are small relative to the coefficients
Nice explanation of diagnosing models from residual plots and problem fixes from qualtrics.XM
Create productivity-increasing, fancy residual-to-fitted plots from BlogR Simon Jackson (@drsimonj)
Interpreting
summary(lmMod)
R regression outputs are divided into
Call (what formula was used to fit model)
Residuals
can be manually calculated with
summary(data$depVar - model$fitted.values)
check for
Median ~= 0 => residuals symmetrical / model predicts high and low end of data range equally well
if median < 0 => right-skewed so model does not predict higher-end of data range as well as lower end / QQ-plot for higher end will not fall on diagonal
Coefficients
for simple regression outputs, each indep var, reports estimate, standard error, t-statistic (= estimate/stdErr), p-value (Pr(>|t|) are p-values based on t-distribution)
intercept = mean of the reference levels of all variables
each coefficient is the change in dep var with each unit change (or categ change if dummy-coded categorical var) of that indep var.
If a categ var, prepend by "is"
Residual standard error
how well the model fits the data (how much estimates deviate from the real data points)
Multiple R-squared and Adjusted R-squared
what percentage of the variation within our dependent variable that the independent variable is explaining
adjusted R-squared adds a penalty for additional predictors and favored in multiple regression models because R-squared will always increase as more variables are included in the model
PITFALL: a high value of R-squared is not correlated with a more important finding (or low R-squared low importance)
F-statistic and p-value
for overall model -- null hypothesis is that the coefficients for all of the variables in the model = zero
So far from Statistics by Jim
Coefficient values are linked to the units they are in and whether they are centered/standardized so a small coefficient value has no bearing on how important it is.
However, coefficients of standardized variables, which are unitless, can be used like standardized effect size because they indicate the strength of the relationship between variables. When all continuous variables are standardized, the one with the largest absolute value is the most variable that explains the data variance the most.
The heading says it all.
R-squared ( = explained variation of the model / total variation of the model) reflects the goodness-of-fit of the model. Therefore, the change in R-squared in adding a given independent variable represents the unique portion of the goodness-of-fit that is attributable only to that independent variable
contrast types = coding schemes = parameterization styles
LIkert scale data is ordinal, discrete data with limited range but cannot necessarily be considered interval data, which would mean that the points are equally spaced. There is no consensus on how to analyze it with some saying ANOVA is robust enough to handle it and others saying it's a big no-no and say a non-parametric equivalent like Mann-Whitney or Kruskal Wallis is more appropriate (although these also assume variables to have continuous distribution functions, like ANOVA). Similarly, some say it can be treated like normally distributed and a standard linear regression is fine whereas others advocate that an ordered logistic regression or multinomial logistic regression is more appropriate.
Winter and Dodou (2010, Practical Assess, Res Eval) used simulation tests to compare 2-sample t-test and Mann-Whitney on Likert data test and found that they are produce comparable false positive rates.
Some discussion on P.Mean and sample comparison of approaches on Salvatore's Mangiafico's rcompanion.org, numerous discussions on StackExchange including this one, and using ANOVA with Likert scale data
Andrew Lapointe's authoR Reproducible Science for Busy Researchers with loads of experience-laden tips about data organization and analysis, maximize R, data exploration and mixed regression modeling.