Exploring the Dynamics of College Basketball
Post Date: 11/18/2024
During my undergraduate education, I had the opportunity to work on many projects, both for classes and outside of classes. This work was my capstone project for my Bachelor's degree in Applied Mathematics, we were given 5 weeks to pick a machine learning project of our choosing. Two of my close friends, Joey Halloran and Henry Waterhouse, and I teamed up to explore the dynamics of how variable importance changes over seasons towards predicting end-of-season success using NCAA Basketball box score data.
To summarize, our data set contains college basketball game data which we clean and modify into a data frame of seasonal-cumulative team box score statistics. PCA and clustering methods reveal what features separate teams of similar or differing performance. PCA shows that team performance is centrally distributed, so we perform outlier detection to identify teams with high variation. To see how the importance of box score statistics change over time, we perform season by season regression on various subsets of features to track feature importance values per season. The visualizations of these feature coefficient values per season reveal that the statistics most important to winning games have remained relatively constant over the last 21 years.
The paper goes into detail about the data engineering, exploratory data analysis, and machine learning used to answer the question of variable importance dynamics, and shows the dedication and careful thought our group invested in it. The code for this group project can be found here.
I must note that since I've gotten my M.S. degree in Statistics, I've learned there's a couple things done here that could be improved! Notably, XGBoost would be a great model to compare to the linear models discussed below, and can more reliably get estimates of feature importance.