Regression and ANOVA are different ways of formatting output from the same model!

The total length of the videos in this section is approximately 72 minutes. Feel free to do this in multiple sittings! You will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

The videos below use R examples to illustrate that regression and ANOVA are the same method, with different conventions for formatting the results. Though these are screencapture videos of me running code, the takeaway is about conceptual understanding rather than how to code.

Please download this code file and follow along with the videos.

One predictor, categorical or continuous, part 1

RegressionANOVA.1.OnePredictor.mp4

Question 1: How did we handle the missingness on the variable called "vore"?

  • mean imputation

  • creating a factor level called "missing"

  • imputating by drawing from the observed distribution

  • imputing by drawing from the observed distribution, for a subset of rows

Show answer

Creating a factor level called "missing." Especially when the variable is already a categorical variable, creating a category for missing values is a straightforward, assumption-free way to move forward with the analysis without dropping any rows that are missing on that variable. This method ignores any information in the rest of the data about best guesses at those missing values, though, as well as what you know from the context about how the pattern of missingness relates to the missing values.

One predictor, categorical or continuous, part 2

RegressionANOVA.2.OnePredictorP2.mp4

Question 2: If you run an ANOVA, what is the fitted value of logsleep for the group herbivore?

  • The mean value of logsleep observed for the herbivore group.

  • Mean value observed for logsleep for all animals except for herbivore.

  • The coefficient indicating the presence of the herbivore group.

Show answer

The mean value of logsleep observed for the herbivore group.

Question 3: How do you get the SSR from an ANOVA output if the ANOVA model is named m1?

  • m1$fitted.values

  • sum(m1$resid^2)

  • sq(sum(m1$aov$resid^2))

Show answer

sum(m1$resid^2)

Question 4: Which of the following is the Mean Square Error in ANOVA?

  • Sum of Squared Residuals / Residual degrees of freedom

  • Sum of Squared Residuals

Show answer

Sum of Squared Residuals / Residual degrees of freedom

Question 5: For a linear model with more than one coefficient specified, what does the p-value for each coefficient show?

Show answer

The p-value for a particular coefficient is the result of a test comparing (i) the actual model that you ran with (ii) the same model with that particular term omitted.

One predictor, categorical or continuous, part 3

RegressionANOVA.3.OnePredictorP3.mp4

Question 6: What does the bottom row output of the summary for the linear model “lm” show?

Show answer

The bottom row summarizes an F-test comparing the entire model you specified to the equal means model.

Question 7: What is the residual standard error?

  • SSR (aka Sum of Squared Residuals)

  • SSR/degrees of freedom (aka, Mean Squared Error of Residuals)

  • SquareRoot(SSR/degrees of freedom)

Show answer

SquareRoot(SSR/degrees of freedom)

Question 8: Check all that equal to SSM/SST:

  • R-Squared

  • Sum of Squared Residuals

  • Mean Squared Error

  • 1-SSR/SST

  • Residual Standard Error

Show answer

first and fourth options

Question 9: Suppose that the rows in your data set are US counties, and your outcome variable is the total number of people in each county who vote in an election. If your predictor variable is the state in which each county is located, and your research question is whether voting varies across states, which type of output would you rather see?

  • Regression output

  • ANOVA output

Show answer

ANOVA. The ANOVA table will provide one p-value for the entire categorical state variable, reporting the results of an F-test that compares the equal means model to the model that predicts a separate mean for each state. The default regression output might report the results of the same F-test as an afterthought at the bottom of the output, but the regression output focuses on the separate coefficients for each state.

Question 10: In the same context as the previous problem, if you are interested in which states have the highest and lowest numbers of voters, which type of output would you rather see?

  • Regression output

  • ANOVA output

Show answer

Regression output. The ANOVA output will evaluate whether the state variable as a whole is useful for predicting vote counts, but the ANOVA table will not report which states have higher or lower vote counts. The regression output, though, includes the estimated coefficients for each state (except for one baseline state), so that you can assess which states have higher or lower predicted values. Each p-value in the regression output compares the model where every state has a different mean vote count to the model where every state has a different mean vote count, except that the state associated with that p-value has the same mean vote count as the baseline state. This may not be a particularly interesting test!

Multiple predictors, part 1

RegressionANOVA.4.MultiplePredictors.mp4

Question 11: When there are multiple predictors in the model, can you use the residuals to directly calculate the sum of squared residuals?

Show answer

Yes. The sum of squared residuals is always literally the sum of the squared residuals. Note that the residuals themselves change when we add terms to the model, because the fitted values change.

Multiple predictors, part 2

RegressionANOVA.5.MultiplePredictorsP2.mp4

Question 12: If there are 5 categories for a categorical variable, why are only 4 coefficients shown?

  • The category not shown is determined to be statistically insignificant.

  • The remaining category that is not shown is the baseline.

  • The category not shown is combined with another category.

Show answer

The remaining category that is not shown is the baseline. If you want to switch baseline, you can either rename the labels alphabetically or use relevel, a function in R that allows you to set baseline.

Question 13: You are trying to predict a student's GPA using the sports that a student plays as a categorical variable. You think that some sports have a greater impact than others on a student's GPA. For example, you think that playing basketball has a larger impact on a student's GPA than playing field hockey. If you want to test this assumption, and if your research question is about trying to come up with a model that might include some of the possible sports but not all of them in one categorical variable, which model output should you look at?

  • ANOVA

  • Linear regression

Show answer

Linear regression. ANOVA will not show the coefficients or significance of individual sports, only a test for the overall correlation of the categorical sport variable with GPA.

Question 14: If you are trying to decide whether you should include the categorical variable of a Wellesley student's major in a model determining the student's future median income, which output’s default would be most useful to the test whether the student's major is an important predictor?

  • ANOVA

  • Linear regression

Show answer

ANOVA. In this case, we are interested in whether the variable as a whole matters, not in comparing specific majors to each other.

Question 15: When you have multiple predictors, is the F-test shown at the bottom of the Linear Regression output also in the output of the ANOVA?

Show answer

No. The F-test in the Linear Regression output compares the entire model to equal means model, while the tests shown in ANOVA table examine each variable individually.

Categorical v. continuous

RegressionANOVA.6.Cat vs. cont.mp4

Question 16: What is one way to tell that we incorrectly handled a continuous variable as a factor or categorical in an ANOVA?

Show answer

The degrees of freedom is not 1. A continuous variable will have a degrees of freedom of 1. On the other hand, if your categorical variable has more than one category and the degrees of freedom is shown to be 1, then you will know that the variable was incorrectly treated as continuous.

Question 17: How do you tell if R handled a categorical variable with three categories as continuous variable in a regression?

Show answer

We see only 1 coefficient estimated for the categorical variable. We should see separate coefficients for each level of the category, except the baseline (so, if there are three categories, we should be estimating two coefficients). The categorical variable has been incorrectly treated as continuous if there is only one coefficient for the variable, unless the categorical variable had only two categories.

Multicollinearity, part 1

RegressionANOVA.7.Multicollinearity.mp4

Question 18: Suppose you are trying to predict height and your model includes a variable for the length of an individual's left foot and a variable for the length of an individual's right foot. When you run a linear regression on this model, which of the following tests will likely be significant? Check all that apply.

  • F-test at the bottom of the linear regression output.

  • The t-test for the coefficient for left foot.

  • The t-test for the coefficient for the right foot.

Show answer

This is one of my favorite examples. The F-test at the bottom of the linear output will be significant because the model you are testing is more appropriate than the equal means model. However, the t-test associated with the left foot coefficient compares a model with only the right foot to the model with both feet - the feet are very correlated, so the right foot is good enough, and the p-value is large. The t-test associated with the right foot coefficient compares a model with only the left foot to the model with both feet - the left foot is good enough, and so the p-value is large. When you put two correlated variables in a model as predictors, the individual p-values for the coefficients will not be significant, because a model that only includes the other predictor is fine. However, these large p-values do not imply that the information in these variables is not important. Either left foot or right foot would be significant if only one of them is included in the model.

Question 19: Would it be sufficient to run a linear regression with several predictors and then omit the variables that are not significant?

Show answer

No, the variables that are not significant could be important, but perhaps they are correlated with other predictors that you included in the model.

Question 20: Consider two predictor variables that, individually, would be significant for predicting an outcome variable. These predictor variables are highly correlated with one another. What will we see from an ANOVA output when we run, in one model, these two variables as predictors for the outcome variable?

  • Both variables are not significant.

  • Both variables are significant.

  • The variable specified first would be significant and the variable specified second would not be significant.

  • The variable specified second would be significant and the variable specified first would not be significant.

Show answer

The variable specified first would be significant and the variable specified second would not be significant. The default in R is that order matters in an ANOVA output. Therefore, the variable specified first will be statistically significant.

The outputs for a regression v. an ANOVA report the results of different tests, by default. This question asks you to carefully identify which tests are reported by which default outputs. The task is not easy!

Feel free to actually run these models as you think about the questions (though, that is not required).

After recording these videos, I learned something useful: if you have two models that have each been created with either lm or aov, you can conduct the F-test comparing the two models with the "anova" function. For example, if the output from the first model is "lm1", and the output from the second model is "lm2", then "anova(lm1,lm2)" conducts the F-test that compares them. This approach is helpful for confirming which comparisons are displayed in the output from a regression or ANOVA.

Once you've attempted to answer this question once, please look at the explanations, and then try again to use the msleep data to convince yourself.

Question 21: Given the msleep data, suppose that you run a linear model to predict logsleep. You specify the predictors logwt and vore, in that order, with no interaction.

Which of the following tests will be reported as part of the default regression output, in R?

  • Comparison between equal means model and model that includes logwt and vore

  • Comparison between model that includes both logwt and vore, and model that includes only vore

  • Comparison between model that includes both logwt and vore, and model that includes only logwt

  • Comparison between model that includes both logwt and vore, and model that includes both logwt and vore but assumes that herbivores have the same mean log sleep as the baseline vore (say, carnivores)

Show answer

Options 1, 2, and 4. See the answer to the next question for explanations of the four options. Remember that the p-value for a coefficient in the regression output is always for the null hypothesis that that coefficient is equal to zero. When a particular coefficient is equal to zero, then you have the same model you ran, except without that particular term. So, the p-values for the coefficients each compare the model you ran to the same model with that one term omitted. Many of these comparisons are not interesting or useful! The goal is for you to be aware of what comparisons those p-values reflect.

Question 22: Continuing the example in the previous question, which of the following tests will be reported as part of the default ANOVA output, in R?

  • Comparison between equal means model and model that includes logwt and vore

  • Comparison between model that includes both logwt and vore, and model that includes only vore

  • Comparison between model that includes both logwt and vore, and model that includes only logwt

  • Comparison between model that includes both logwt and vore, and model that includes both logwt and vore but assumes that herbivores have the same mean log sleep as the baseline vore (say, carnivores)

Show answer

Comparison between model that includes both logwt and vore, and model that includes only logwt.

Option 1 is the F-test shown at the bottom row of the regression output. If there is only one predictor, the bottom row of the regression output shows the same F-test that is shown in the corresponding ANOVA table. If there are multiple predictors, the bottom row of the regression output shows an F-test that is not equivalent to anything shown in the default ANOVA table.

Option 2 is the test in the row labeled "logwt" in the regression output. This test actually does not appear in the ANOVA if the predictors are specified in the order given here, but it would have if the predictors had been specified in the opposite order - see next video!

Option 3 is the test in the row labeled "vore" in the ANOVA output. There is no such row in the regression output, because separate tests for the coefficients of each level of the categorical variable (except for one baseline level) are shown on different rows.

Option 4 is the test in the row labeled "herbi" in the regression output. There is no such row in the ANOVA table, because the ANOVA reports tests of the entire categorical variable rather than coefficients for specific levels of the categorical variable.

Multicollinearity, part 2

RegressionANOVA.8.MulticollinearityP2.mp4

Question 23: For which of the types of R output does the order of the predictors determine the tests that are reported?

  • Regression

  • ANOVA

Show answer

ANOVA. The regression output will always show the results of tests comparing the entire model, to the entire model without each predictor. R's default ANOVA output will built up the predictors one by one, testing whether the first predictor alone is better than the equal means model; whether the first two predictors together are better than the first predictor alone; whether the first three predictors together are better than the first two predictors together; etc. Though it seems strange that the output would depend on the variable order, the tests shown in this ANOVA output are more likely to show us whether we have important predictors in the data set. If two predictors are highly correlated with each other and with the outcome, the tests shown in the linear model for each of the predictors may not be significant, because including just one of the two predictors is sufficient (aka "multicollinearity").

Did I convince you, even a little? I also encourage you to experiment with ANOVA and regression output in R.