Model selection 2

The total length of the videos in this section is approximately 58 minutes. Feel free to do this in multiple sittings! You will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

Adjusted R-squared

Model Selection2.1.Adjusted R^2.mp4

Question 1: What is an advantage of adjusted R-squared over R-squared?

Show answer

R-squared increases every time you add a term, even if that term is not useful for predicting the outcome. However, adjusted R-squared takes into account the number of parameters in the model and does not necessarily increase just because a new predictor was added.

Information criteria

Model Selection2.2.Information Criteria.mp4

Question 2: What are the two aspects of a model that are reflected by the BIC and AIC?

Show answer

One is the sum of squared residuals - whether the model fits well. The other is the number of parameters - whether we are likely over-fitting.

Cross-validation and root mean square error

Model Selection2.3.CrossValidation.mp4

Question 3: What is the goal of cross-validation?

Show answer

The goal of cross-validation is to check whether you are over-fitting.

Question 4: Suppose that you have selected a model that seems to fit the train data very well. You estimate the parameters of the model by running the R function lm on the train data. Do you run the lm function on the test data?

Show answer

No!! Don't do this! This will re-estimate the coefficients. But, the whole point is that you estimate the coefficients by fitting the model to the train data, and that you use those exact coefficient estimates to make predictions for the test data and see how close your predictions are to the true outcome values.

Question 5: If you are over-fitting, will the root mean square error be higher when calculated on the train data set or the test data set?

Show answer

If you are over-fitting, the model will fit the train data much better than it will fit the test data. So, the RMSE will be much bigger for the test data than it is for the train data.

Here is an article about the Netflix Contest, which was the first of its kind. Here is another article. Please read or skim. This was an important moment in the evolution of data science.

Question 6: Briefly describe one aspect of the competition or its aftermath that did not go as anticipated.

Show answer

Among other things, though the contest lasted three years, the winner was decided by 20 minutes. Netflix didn't end up using the winning algorithm. They tried to set up another contest later, but it was cancelled when they faced law suits about privacy. Now that Netflix centers around streaming video instead of mailing DVDs, the need to predict what movies people would enjoy has decreased.

Take a look at kaggle. It includes both competitions and data sets posted for general exploration.

Question 7: How much is the largest prize currently offered through one of kaggle's contests?

Show answer

Last time I checked, a total of $150,000 was available for a data competition about relating code to comments in python notebooks, with a first prize of $50,000.

Now for some R examples and slides

Adjusted R^2, when adding polynomial terms to a model

Model Selection2.4.Adjusted R^2.mp4

Question 8: Can adjusted R^2 be negative?

Show answer

Yes. It is a weird quantity that does not have a nice interpretation, the way that R^2 does.

Kentucky Derby example

ModelSelection2.5.KentuckyDerby.mp4

Question 9: Either in words or by writing down an equation with betas in it, tell me what terms belong in this model. What goes on the right side of "Mean (Speed) = ..."?

Show answer

See next video.

Fitting a model for Kentucky Derby data

ModelSelection2.6.ModelKentuckyDerby.mp4

Question 10: What are some of the things you should do when trying to decide which terms to include in a linear regression model?

Show answer

Perhaps save some of the data for cross-validation. Visualize the data in lots of ways. Consider what you know about the context. Consider transforming one or both variables. Take a look at residual v. fitted value plots. Assess whether a model's assumptions are true. You might run an ANOVA or look at a coefficient p-value to compare two models that you are interested in. Perhaps calculate measures of model fit like R-squared, adjusted R-squared, BIC, or AIC.

Question 11: Should you run an algorithm to consider all possible models and choose the one that is best on some criterion, like adjusted R-squared, BIC, or AIC?

Show answer

This is called "step selection." This is generally not a good idea, because you will miss the context in a way that might make the data less interpretable. For example, if your data set includes the two highly correlated variables "gender" and "hair length" as possible predictors for income, a step selection might choose to include "hair length" instead of "gender" if that reduces the residuals a tiny, tiny bit.

You did it! Hooray!