Let's play Moneyball and use some statistics and regression to see what best predicts how many runs a team scores. You will make a model based on the 2023 season and then test that on the data from the 2024 season to see if you picked the right predictors.
The goal of this exercise is to learn how to make a more complicated model than a simple linear regression.
Load this dataset, which has stats for all 30 MLB teams from the 2023 season. Statistics included are runs, at bats, hits, homeruns, batting average, strikeouts, stolen bases, wins, OBP, slugging percentage and on base slugging. The last three are the main ones pointed out by the book/movie Moneyball as supposedly better predictors of success than the traditional variables.
Let's test the relationship between at bats and runs. Fit a linear model, here are two different syntaxes for it - one we haven't seen yet. Going forward, we will use the second one - basically, it helps us avoid having to write mlb23$ at the start of each variable name.
model1 <- lm(mlb23$runs ~ mlb23$atBats)
model1 <- lm(runs ~ atBats, data=mlb23)
Remember that you read this as "runs BASED ON ~ atBats, so y=runs, x=atBats. Call up summary() on your model to check it out. Look at the slope, r^2 value. Plot() the variables and the abline(). Comment on how appropriate and strong you think this model is.
Also call plot() on the model to check the first two graphs given. The residuals should be randomly distributed with no apparent pattern, and should be normally distributed in terms of their distance away. Comment on the residuals.
What if, instead of just including one of these predictors, we used multiple variables? So, instead of predicting it just based off of at bats, we did it off of at bats AND homeruns.
We can do that, and R can do that! It will take each variable and weight it appropriately to find the best combination model! Here is an example of one of those models.
model2 <- lm(runs ~ atBats + homeruns, data=mlb23)
summary(model2)
Can you write out the equation that you're given? A clue is that it should be in the form of y = m1*x1 + m2*x2 + b. Check with someone before you continue.
Use your equation to predict the runs for the Nationals by plugging in their at bats and homeruns (just go look in the dataset). The actual value is 700 and the answer you should get from the model the number 705.4 (this means that the residual is -5.4 right?).
*CONFIRM THIS NUMBER FOR YOURSELF BEFORE YOU MOVE ON*
Okay, so we can't look at a scatterplot of these variables because there are 3!! So we have to look at three things to help us.
Look at the summary again. Based on the r^2, this model of the two variables is better than either on its own. The r^2 value for atBats was 0.3791 and for homeruns was 0.5657. The combined model r^2 is 0.7266.
Plot the model and make sure the residuals are good! If they aren't, even the best r^2 isn't all that helpful.
plot(model2)
Another way to see that it's pretty good is to plot the predicted values against the actual values, and then plot y=x (y=0+1*x, which is in abline below). This helps us see where the values should be if the prediction was perfect.
plot(mlb23$runs,predict(model2))
abline(0,1, col="blue")
Okay you've seen one better model that uses two variables... Find a combined model that's even better than the one I showed you. Make sure for each one that you:
Check diagnostics by running plot(model#)
Plot the predictions against the actual values to see how close they are.
Save each model that you use with info about it in this file so you can look at your work and compare.
By the way, you can include as many variables as you want... just add more + signs! e.g. lm(thing ~ thing1 + thing2 + thing3 + thing4, data=mlb23)
We will test your model against data from the 2024 season to see who made the best predictor!
# base code for making models with multiple linear regression
model% <- lm(runs ~ thing1 + thing2 + ..., data=mlb23)
summary(model%)
plot(mlb23$runs,predict(model%))
plot(model%)
abline(0,1, col="blue")