Another aspect of our project was analyzing which statistics were relevant to NFL and which statistics were relevant to College football. We were able to conclude which statistics were actually significant per year by using machine learning and regression models. Using 4 different methods of model selections, we were able to accurately find the most important predictors that impact both an NFL and College game. Throughout the 2000-20001 season and 2013-2014 season, there were some statistics which impacted the game much more than others and other statistics which were irrelevant. For example, when looking at the NFL, time of possession, the amount of pass completions, rush attempts for example did matter some to most years. However, there was not a season where every single statistic was an important factor in a team winning. For College football, we were only given 7 different statistics which are fumbles, passing attempts, passing completions, passing interceptions, passing yards, rushing attempts, rushing yards. All of these measures mattered in a team’s success.
An example for how the models were constructed for the year 2000 is provided below. All other models were constructed in a similar fashion.
For the year 2000:
Using an All Subsets method of producing a model, for the year 2000 we were able to find the predictors from the data set that were most important in determining the outcome of a game. The statistics that were closely looked at were Mallow's CP and Adjusted R-Squared values to explain the variance in the data. In order to confirm our model we also ran backwards, forwards, and stepwise elimination methods all which produced the same results. The all-subsets method runs through every possible model that could be made with the 12 NFL predictors that are provided, and compares the strength of each model. After obtaining a model, we have also provided a residual analysis on our model in order to verify that the predictors accurately follow the conditions for a linear model.
All Subsets Method Result:
As we can see, for the year 2000, the best model had a Mallow's CP of 13 and an Adjusted R-Squared of 83%. Residual analysis of the model is provided below to support the strength of the model.
The model that was produced is below
ScoreDiff ~ FirstDownDiff + RushYdsDiff + PassAttDiff + PassYdsDiff + PassIntDiff + FumblesDiff + SackYdsDiff + PenYdsDiff + ThirdDownPctDiff + TimePossDiff
Residual Analysis:
As we can see, the residual analysis proves that this model explains the variance in score very well.
In order to see even more analysis, let's take a look at each of the 12 individual predictors vs. Score differential in order to gauge their importance.
As we can clearly see, certain predictors have very strong correlation with score differential. Looking specifically at Pass Completion Differential, it is interesting to note that it does not seem to be linearly correlated with winning, and as such over the years it is not a very significant predictor at determining success.
Note: While some predictors that were left out of the model may appear to be linearly correlated with winning, they were left out because a lot of multicollinearity exists with the other predictors.
Below is a data graphic showing which predictors were significant in the years 2000-2013. As we can see almost every predictor was always important other than Pass Completions, Penalty Yards, and Rush Attempts. This shows the fact that football is a complementary game, and all aspects of the game are equally important.
The same analysis that was done on NFL teams was done on the College Datasets as well, however for every year from 2000-2013, every predictor in college football was significant in predicting the success of a team. This is shown below: