Adam Slivinsky - Dynamics in NCAAB Data

Exploring the Dynamics of College Basketball

Post Date: 11/18/2024

During my undergraduate education, I had the opportunity to work on many projects, both for classes and outside of classes. This work was my capstone project for my Bachelor's degree in Applied Mathematics, we were given 5 weeks to pick a machine learning project of our choosing. Two of my close friends, Joey Halloran and Henry Waterhouse, and I teamed up to explore the dynamics of how variable importance changes over seasons towards predicting end-of-season success using NCAA Basketball box score data.

To summarize, our data set contains college basketball game data which we clean and modify into a data frame of seasonal-cumulative team box score statistics. PCA and clustering methods reveal what features separate teams of similar or differing performance. PCA shows that team performance is centrally distributed, so we perform outlier detection to identify teams with high variation. To see how the importance of box score statistics change over time, we perform season by season regression on various subsets of features to track feature importance values per season. The visualizations of these feature coefficient values per season reveal that the statistics most important to winning games have remained relatively constant over the last 21 years.

The paper goes into detail about the data engineering, exploratory data analysis, and machine learning used to answer the question of variable importance dynamics, and shows the dedication and careful thought our group invested in it. The code for this group project can be found here.

I must note that since I've gotten my M.S. degree in Statistics, I've learned there's a couple things done here that could be improved! Notably, XGBoost would be a great model to compare to the linear models discussed below, and can more reliably get estimates of feature importance.

Exploring the Dynamics of College Basketball.pdf

Next, I include some of the data exploration and results from this work. First, we worked with data from the 2003-2023 NCAAB seasons (Including the COVID season). For each team in each season, we calculated the team cumulative and average box score statistics throughout a season. Initially, one may think to just use a regression model on the home and away teams' box score game-prior averages separately, but doing so ignores the obvious interaction present within the game. So, we calculate the ratio between these two, for example:

We then use these ratio statistics for the rest of the machine learning in the project. We perform PCA on the season-averaged ratio statistics, showing that the first principle component captures the variance between teams of differing performance, while the second principle component captures the variance between teams of similar performance. Overall, the first two PC components capture ~60% of the variance within the data.

We show the loadings of the first two principle components to see which variables contribute towards each of these two components. In the first principal component, all of the statistics that relate to a beneficial outcome for a team such as field goal makes and assists have positive components and the only negative statistics are turnovers and personal fouls which are detrimental to a team. Principal component 2 is a little bit harder to interpret; the positive components are the two free throw related statistics, as well as defensive rebounds, turnovers, and blocks.

The PCA plot shows teams binned by:

(0-5 wins, 6-10 wins, ...)

Although reducing the dimensionality of data with PCA results in one centrally-distributed clump, we can use the RATStats to predict end-of-season outcomes quite well. We develop a few models to predict the number of games won at the end of the season as a regression on all of the RATStat features, including vanilla Linear Regression, LASSO, Ridge, and Elastic Net Regression.

We train each of the four models on the seasons individually, instead of jointly. Since each of the covariates are standardized, we can interpret the magnitudes of the fitted model coefficients as relative feature importance. To restrict feature importance between 0 and 1, we divide each season's feature importance values by their sum. This allows us to compare feature importance values between seasons, as the method and scale is the same for all seasons of data.

This plot shows us that the ratio of home and away average score is one of the best predictors when not penalizing high values of fitted covariates. Ridge and Elastic Net, comparatively, have more a more consistent, recognizable trend between feature importance changes season-by-season. These results can tell us that field goals made, defensive rebounds, turnovers, and free throws made are some of the most important predictors of end-of-season success.

We can also repeat this method while performing feature selection. We found the best-performing model with only three covariates included free throws made, field goals made, and 3pt field goals made (All scoring statistics), and stayed relatively constant over the 21-year span. This model had only slightly greater error than the model that included all covariates.

My group started with a hypothesis: Steph Curry changed the game in the 2010s by revolutionizing 3pt shooting. Can we see Steph Curry's effect on 3pt shooting within this set of NCAAB box score data?

From this plot, the evidence points to the conclusion that there is no such effect present. The importance of 3pt shooting relative to both 2pt field goals and free throws has not changed significantly, which was surprising.

As basketball fans, because of how different modern day basketball appears to be from basketball even 10 years ago, we expected there to be dramatic changes in the importance of different stats over time. Our linear models fit to box score statistics told a different story: we found that the importance of certain fundamental stats has remained relatively constant over the years.