Model selection 1

The total length of the videos in this section is approximately 54 minutes. Feel free to do this in multiple sittings! You will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

Intro to model selection

Model Selection.1.Intro.mp4

Question 1: Suppose we fit a regression model to a particular data set. Will the residuals shrink (on average) if we add more predictors or take away some of the predictors?

Show answer

The residuals will shrink if we add more predictors. Every time we add a predictor, the sum of squared residuals will shrink (or at least not grow), because we are more closely describing the pattern in the data set. This is true even if the apparent relationship between the added predictors and the data set occurred by chance.

Overfitting examples

Model Selection.2.Overfitting.mp4

Question 2: Why is overfitting undesirable?

Show answer

Overfitting occurs when we choose a model that describes our data set too closely, even reflecting the patterns in the data set that occurred by chance rather than due to real relationships between the variables. If you draw conclusions about the relationships between variables using an overfitted model, or if you attempt to make predictions using an overfitted model, you will likely be wrong.

R-squared

Model Selection.3.R-Squared.mp4

Question 3: Why is it a bad idea to choose the model that has the highest R-squared?

Show answer

R-squared never decreases when you add predictors, even if the added predictors are not actually helpful for predicting the outcome. Maximizing R-squared leads to severe overfitting.

R-squared is equal to 1 when number of parameters equals number of data points.

Model Selection.4.R-SquaredP2.mp4

Question 4: Suppose that I have three unknown variables. How many equations relating these variables do I need in order to exactly solve for each of the three variables?

Show answer

Three. If you have k unknown variables, you need exactly k equations to solve exactly (technically, you need k linearly independent equations, which means that they are really k different equations. For example, x + y =1 is the same equation as 2x + 2y = 2). If you have more unknowns than equations, then the equations could be true for more than one set of values for the variables. If you have more equations than unknowns, then the equations cannot be solved exactly.

More on R-squared

Model Selection.5.R-SquaredP3.mp4

Question 5: What will happen to R-squared when we zoom in on a subset of the data, defined by a smaller range of X?

Show answer

If we define a subset of the data based on a smaller range of X and recalculate R-squared using this subset, then R-squared will decrease. Explanation in next video.

R-squared: zooming in explanation; what about linearity?

Model Selection.6.Linearity.mp4

Question 6: Does a high value of R-squared imply that the equation relating the mean of Y to the predictors is likely correct?

Show answer

No. R-squared is not a check on the linearity assumption, as it can be high even when the model equation is not correct. This fact is related to the idea that R-squared increases even when you add useless predictors. A model equation with many useless predictors is certainly not correct, but it could lead to a high R-squared.

Illustrating R-squared

Model Selection.7.Illustrating R^2.mp4

Question 7: Which of the three pictures has the highest R-squared?

Show answer

The picture on the right (the third one). Explanation in next video.

Visualizing within and between variance

Model Selection.8.Visualizing Within Variance.mp4

Question 8: Imagine two graphics, each showing Y values grouped by a categorical variable. The means of the Y values within each level of the categorical are the same in the two graphics. However, the residual variance is bigger in graphic 1 than in graphic 2. Which graphic has a higher R-squared?

Show answer

This is a written version of the previous question. If the group means stay the same, then R-squared will be bigger when the points are closer to these means. So, graphic 2 will have a higher R-squared, due to smaller residual variance.

R-squared with polynomial terms

Model Selection.9.R-Squared Polynomial.mp4

Question 9: Suppose I know that Y is generated according to a simple linear regression model with predictor X. I have a sample of 100 data points from this model. If I try fitting a model that includes an intercept, X, and also X-squared, what will happen to R-squared?

Show answer

I realize that the focus on this idea has been repetitive, but it's important: even though that squared term is not part of the true model, including it when we fit a regression model to a sample of data points will increase R-squared. The reason is that, after we fit a model with just an intercept and X, the sample correlation between X^2 and the residuals is not exactly zero in the data set, even though the true correlation between X^2 and the residuals from the correct, simple model is zero.

I'm putting the rest of the model selection material in a separate module.

During this tutorial you learned:

About parametric model selection and selection criteria
What makes a model a ‘better fit’
More about overfitting, why it is undesirable, and a visualization illustrating overfitting
About R-squared and its interpretation, when it’s related to correlation (r), what happens to R-squared when adding more terms to your model, when R-squared is equal to 1, the relationship of R-squared and linearity, and what happens to R-squared when the range of model predictors change
How to relate regression to solving a system of equations
How to visually understand the interpretation of R-squared, variability of the outcome variable is explained by the model

Terms and concepts:

Model selection, overfitting, R-squared, correlation (r), linearity