What We Learned

DSP Techniques from In and Out of Class

From inside of class:

  • Cosine Similarity - A method used to see how similar two vectors are
  • SVD - SVD is the main component of supervised PCA to find the principal components of the data

From outside of class:

  • Binomial Logistic Regression - estimating the parameters of a logistic model
  • Lasso Logistic Regression - A low-bias, regularization of binomial logistic regression
  • Feature Selection - A method used to find out which stats are relevant in predicting future success
  • Supervised PCA - A method to find correlation within data set
  • Computer Vision Toolbox - insertText command to write text onto an image

Lessons Learned

Overall, we learned that it is very hard to predict the NCAA tournament. There is a reason it is called March Madness. Our best predictor would have only been in the 97th percentile of all brackets submitted on ESPN this year. We recommend utilizing our project as a starting point for your future march madness brackets, and to slightly change them how you see fit. After trying various statistical analysis techniques, we determined that lasso logistic regression was the best in predicting the outcomes of the tournament. Even creating an ultimate predictor that harnessed the power of many statistical analysis techniques did not outperform lasso logistic regression. We believe that there are likely better methods for predicting march madness (and better ways to create an combined predictor), but that the methods do not get much better than what we have found here, as there is a lot of randomness inherent in the outcomes of single basketball games.

Along the way, we learned many things about statistical analysis, including how we should normalize our data, how to use different models, and how some models are not ideal for the task at hand. When normalizing our data, we chose to divide everything by the max value, in order to get a value between zero and one. We also looked into normalizing the data by taking the norm of each statistic vector, and dividing every value by that norm. We would recommend to someone trying to improve upon our project to try this normalization technique and to see how it fares. In our analysis, we only used the last five years of data, as we believed this is when the game of basketball underwent its "three-point revolution". We recognize this may have not been the best solution, as the game is constantly undergoing change. We would recommend to someone trying to improve upon our project to potentially use more years of data, but to weight the newest data the most in the analysis. The justification for this is that as the game undergoes change, the year that has a basketball style closest to that of the current year is the past year. To implement this, we suggest something like a hyperbolic discounting model (while this model is typically a behavioral economics analysis tool, the way it discounts past years could be very relevant here).