Develop a model to predict the outcome of the NFL playoffs based solely on each team's regular season on-field play without simulation.
Opponent Adjusted Win-Loss Record: My personal approach for crediting teams with win value based on the quality of opponent when you played them and by the quality of win. I've found this to be more stable and predictive than wins, adjusted wins with one-score games as 0.5 wins, and Pythagorean wins. You can find these values in my Roster Strength shiny app. I also passed in features for the rolling average of your previous 5 games.
Opponent Adjusted EPA Values: These are EPA per play values for offense and defense adjusted for the opponent faced as well as the game script to scale down high or low win probability values as I've found they are less predictive. Additionally, these values are weighted to value recent games more highly when projecting a team forward since Week 1's results obviously matter, but not as much as Week 17 when projecting the playoffs. You can find these values and further explanation on my Season EPA Metrics page. I also passed in features for the rolling average of your previous 5 games.
Roster Strength: These are Wins Above Replacement values generated from Pro Football Focus grading based on a linear regression of team-level grading facet with Pythagorean wins. The facet weights are determined by the beta coefficients of that model and then each player's grades are scaled based on the logarithmic difference from a replacement level player and by how many snaps that player had in a game. For this analysis I utilized both the aggregated values for the entire non-quarterback roster, overall defense, overall offense, coverage, pass rush, run defense, pass blocking, run blocking, offensive line, receiving, and rushing values. Additionally, I converted these to a per play level so that teams that ran fewer plays wouldn't be penalized or unintentionally feed the model extra information about number of snaps played. You can find these values on my Roster Strength Metrics page and you can find more about PFF's WAR metrics on their page.
Bayesian Quarterback Metrics: These are my composite metrics for quarterbacks based on PFF grading and efficiency in different facets of play. They break down to in-structure (Floor) and out of structure (Ceiling) as well as PFF WAR for quarterbacks and my own combination of these metrics into a single QB value metric. You can find more about these values on my Quarterback Bayesian Updating page.
I needed to ensure there was no data leakage from any actual playoff values, so I recreated all of the above features from regular season values only. Obviously data leakage directly erodes any confidence you should have in a model's ability since it has answers to the test that it shouldn't.
I didn't want the actual magnitude of the values to affect the model's ability to predict the outcomes, since each team only has to play teams within their own season, so I scaled each value within each season to be between 0-1 so that the best team each season will have a 1 in the specific metric. I also wasn't interested at this point in directly comparing between seasons to see which teams were rated better or not, predicting each season was more important to me.
An issue with trying to train a model to predict the playoff outcomes is that only 1 team gets to win the Superbowl each season and I only have all of the above features from 2011-present. To counteract this class imbalance issue I utilized SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic data that was representative of the minority class to increase the sample size. I did this until I had an equal number of observations for each of the minority playoff classes (Championship Loser, Superbowl Loser, Superbowl Winner) with the number of Divisional Round Losers, which ended up being 52 observations per. Obviously there were enough non-playoff observations since most teams don't make the playoffs each season.
Another factor of this model is that it doesn't know anything about the team or the structure of the playoffs itself. For example, some teams might predict as a team that should make and lose the Divisional round while in reality they didn't make the playoffs at all. Also it doesn't know that only one team per conference can make the Superbowl. This was intentional so that the predicted outcomes were a more accurate representation of team strength going into the playoffs. There are plenty of really good simulations out there that can tell you the probabilities, but that wasn't what I was interested in measuring here.
I utilized an XGBoost regression model with an objective function of the RMSE between the prediction and the actual playoff results as an ordinal (SB Win = 5, SB Loss = 4, Champ Loss = 3, etc.). I also utilized a grid search for the best hyperparameters with a leave-one-season-out cross validation and implemented monotonic constraints on the features to avoid weird over-fitting issues. The final values were scaled to be between 0-1 for plotting purposes.
An issue with utilizing such a model on data like this is that it still has a tendency to overfit to the given data. As a method to counteract this I utilized a variety of methods including cross validation, subsampling of both trees and features, regularization, and random dropout until I got a model that was directionally accurate, better than a simple linear combination of features, and could generalize to new values.
In the plots below you can see a sample of the model output with the 2022 season predictions, the feature importance and shap plots, and the predicted values vs the actual outcomes.
Best Hyperparameters: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 300, 'subsample': 0.8, 'monotone_constraints': '(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)'}
Error metrics: Test MSE: 1.06, Test RMSE: 1.03, Test MAE: 0.71, Test R²: 0.60
Overall I am very pleased with the outcome of this model. I feel that it does a good job of representing team strength that matches the eye test.
I want to note that I used a regression model instead of a classification model since I found it more intuitive to see teams represented on a continuous scale and using the relative predictions to gauge how confident the model is on a specific team. This makes it easier to display teams in more relative tiers and was especially useful since the model doesn't know anything about the structure of the playoffs and so it means less that the model would predict a specific team one way over the other for a specific playoff spot.
Biggest surprises with this model: The 2021 Cincinnati Bengals had a very surprising playoff run all the way to the Superbowl given their regular season results gave them a lower projection. The 2018 Pittsburgh Steelers were one of the aforementioned teams that was given a relatively high projection and they didn't end up making the playoffs at all.
The features that the model found most useful were surprising to me outside of the top 2 features (Offensive EPA and Adjusted Win-Loss record). The fact that the PFF WAR of the quarterback was the 3rd-most used feature and that much higher than the composite metrics is fascinating to me since with the eye-test I've found that to be less stable and accurate, but they appear to do a better job of predicting who is playing better in a small sample. Finally, the fact that defensive roster talent, overall roster talent, defensive EPA, and run blocking talent were the next most used features points towards those old adages of "Play defense and run the ball" being potentially more accurate than I would've expected. Admittedly, they are still well behind offensive EPA so offense is still clearly more valuable and predictive.