With the new data I had collected, I needed a way to be able to use all of it and determine what variables have the biggest impact on predicting the winners of NFL games. The best way to do this was through a statistical technique called multiple regression analysis.
Multiple regression analysis is a statistical technique used to examine the relationship between a dependent variable and two or more independent variables. In using multiple regression analysis, my ultimate goal was to estimate the strength and direction of the relationship between the dependent variable(win percentage) and each independent variable(many variables I thought affected winning), while controlling for the effects of all other independent variables. This is useful when trying to understand the complex relationship between multiple variables and how they may jointly influence the outcome of interest. The output of a multiple regression analysis typically includes coefficients for each independent variable, which represent the degree of influence each variable has on the dependent variable, as well as statistical tests to determine the significance of those coefficients. Out of around 45 variables, the 10 below turned out to be the most significant in predicting the winner of an NFL game.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -29.610043 10.885056 -2.720 0.00652 **
away_moneyline -0.007687 0.003247 -2.367 0.01791 *
home_moneyline -0.009711 0.003453 -2.813 0.00492 **
over_odds -0.003332 0.002271 -1.467 0.14242
away_yds 0.011401 0.005037 2.263 0.02362 *
away_yds_allowed 0.012542 0.005416 2.316 0.02058 *
home_pts 0.676400 0.384368 1.760 0.07845 .
home_yds -0.083224 0.043007 -1.935 0.05298 .
home_yds_allowed 0.042849 0.020484 2.092 0.03645 *
home_yds_per_pt 2.191215 0.993454 2.206 0.02741 *
home_yds_allowed_per_pt -0.924352 0.448847 -2.059 0.03946 *
In multiple regression analysis, it is common to divide the dataset into two subsets: a training set and a testing set. The purpose of this is to build a model using the training set and then assess its accuracy using the testing set. The training set is used to estimate the regression coefficients and develop the multiple regression model. The model is fit to the training set by identifying the values of the coefficients that minimize the error between the predicted values of the dependent variable and the actual values of the dependent variable in the training set. The goal is to create a model that can generalize well to new data, that is, to data outside the training set which is where the testing set is used to evaluate the performance of the model developed from the training set. The testing set consists of data that was not used to develop the model. The model is used to predict the values of the dependent variable in the testing set and the predicted values are compared to the actual values. The performance of the model is evaluated by calculating various metrics but usually the R-squared values.
Training Set
Away Wins Predicted Home Wins Predicted
Away Wins 66 27 93
Home Wins 31 80 111
Total 97 107 204
Testing Set:
Away Wins Predicted Home Wins Predicted
Away Wins 33 12 45
Home Wins 17 19 36
Total 50 31 81
According to my model, in the 2021-2022 season, I would have predicted around 64% of the games correctly including the postseason.
In a sample of 204 games from the 2021-2022 season, away teams won 45.58%(93/204) of the time. In those 93 wins, my model had predicted they would win 70.97%(66/93) of their games. Of the 204 game sample, my model predicted that the away team would win 97 games and was correct 66 times which is 68%. Out of the training sample of 204 games, my model predicted the home team would win 107 times was correct 80 times which is 74.77%.
Next, I used a testing data set to assess the performance of the training model. The testing set consists of the 81 other game samples that were not used in the training set. This will allow me to understand how well the training model is when using new data.
Out of the new 81 games, my model predicted the home team would win 31 times and was correct 19 times which is 61.29%.
The model predicted the away team to win 50 times out of 81 games and was correct 66%.
In total, my model predicted the winners of games around 64% of the time. Having only around 7 months to research, collect data, and understand how to create a model that could represent all the data, the model did considerably well. For comparison, the two professional models that I wanted to model mine after are constantly being updated, and have been working on them for over 10 years. NFELO and 538 have prediction percentages of 65.5% and 66.2%.