Methods

Cosine Similarity

One of the first things we did with all of the data we have below is that we took a cosine similarity similar to what was done in Homework 2. The data was first normalized. This way all the values we were calculating were weighted similarly and no stat was given a clear advantage. Then we took each team that was in the tournament this year and came up with an angle using the cosine similarity. We compared that angle to the previous champions in the past 5 years. After comparing a team to each of the 5 champions, we took an average of those 5 angles. This was the plot we got when those average angles were plotted.

The top ten teams that were similar to previous champions were:

  1. North Carolina
  2. Kansas
  3. Houston
  4. Florida State
  5. Buffalo
  6. Iowa State
  7. Tennessee
  8. Mississippi St
  9. Purdue
  10. LSU

Feature Selection

The purpose of feature selection was to see which stats are most relevant in predicting future successes. We started off doing this by seeing which stats in the last 5 years had the highest success rate. The success rate was whoever had the better stat won that game and the entire bracket was determined by that sole stat. Then we compared the past 5 brackets to how the one stat did to determine the success rate of that stat. The success rate for each stat varied significantly with some being 40 % and others almost 80%. When we did this we realized a odd occurrence. Minutes played and field goals were by far the highest stats in predicting success. It was at this moment that we realized the data we gathered from the past 5 years was biased. The data from the past 5 years was taken after the tournament so the further a team went in the tournament the higher their minutes played was. To correct for this biased data minutes played and field goals were ignored since it was obvious that it was a data bias issue and not that minutes played and field goals were that important.

The next step we took in feature selection was using it round by round. We wanted to see if some stats were more successful in earlier rounds and if some stats were very good at predicting the later rounds. The methods used to determine the success rate of the stats round by round was exactly the same as the overall feature selection but implemented round by round. The results of this showed that after the 3rd round there was a significant drop off in success rate for all stats compared to the earlier rounds. Seed for example was about 74% in the earlier rounds but after the 3rd round it dropped down to 50%. This drop was a common occurrence for most stats after the third round. This showed that after the 3rd round you'd have about the same chance to predict the correct outcome as you would flipping a coin.

The third step taken in feature selection was using pairs of stats to predict the outcomes of the games. The method used to calculate this was similar to single stat feature selection. If a team had both stats higher they won the game. If they only had one stat higher it was a wash and counted as a lose. Using this method on the past 5 years, the pair success rate was determined. The results were similar to the single stat selection where no pairs of stats stood out significantly. The highest success rate was 80% which came from minutes played and field goals. Since this was biased data, these pairs were ignored.

The final step taken in feature selection was using pair feature selection round by round. The method used was the same as the overall pair feature selection but done round by round. The results were very encouraging as there was multiple pairs in the 80% during the first round. The stats however, did fall off each round and when pair feature selection was done in the 4th round the results were not good. The pair feature selection results round by round was very similar to single stat feature selection as they both started off very well but fell off drastically as the rounds went on.

Supervised PCA

Supervised principal component analysis (PCA) is a method for quantifying how correlated sets of variables are in a set of data. It does this by transforming data that is possibly correlated into linearly correlated data by using SVD to find the principal components of the data. To run supervised PCA, we used the following MATLAB function that was posted on the MathWorks File Exchange by Gen Li:

How we use supervised PCA is to create a continuous set of weights for all of the statistics that we collected for each team. We ran a normalized set of the data we collected from the last five years through the supervised PCA function. The data was normalized in the way explained on the Data page This data was then changed to be centered on 0 to make the supervised PCA run. After some post processing, the output of the supervised PCA function showed which statistics are correlated with going far in the NCAA tournament. This correlated data was then combined to create a single weight for how important each statistic is. The weights are shown in the heatmap below.

Across all of the years, seed is the most important statistic in predicting how far a team will go in the tournament. The other statistics vary significantly from year to year in how important they are. Since we ran and processed five years of data through the supervised PCA function, we had five sets of weights. These weights were then averaged to create one set of weights for the predictor to use which are shown below.

After averaging the weights, seed is by far the most important statistic. Adjusted offensive rating (Adj ORtg) and adjusted defensive rating (Adj DRtg) are the next two most important. The rest of the statistics are all slightly above or below zero. Statistics that are below zero take away from a teams score in our predictor. This only means that these statistics are bad predictors of a teams success since the statistics are normalized to account for whether more or less are better.

Binomial Logistic Regression

Before exploring lasso logistic regression, we first explored binomial logistic regression. We would later find out that this method would not be best suited for our project, although it seemed promising at first. In binomial logistic regression, the goal is to find predictors that have statistical significance in predicting the outcome of a binary situation. This seems like it would be an ideal statistical analysis technique for us, as basketball games have inherently binary outcomes (there are no ties in division 1 college basketball). An example of binomial logistic regression can be found below.

In binomial logistic regression, we are attempting to map a curve to the graph shown here. We could use a simple y =mx + b equation (straight line) to describe the data, but that will not do the job sufficiently. Instead, we want to fit an "s" curve to the data as it will more accurately represent the data-predictor relationship. This gives rise to binomial logistic regression, more commonly known as logistic regression. To create this s-curve, we use the following equation: p(x) = 1/(1 + e^-(B_0 + B_1*x)). We then convert this nonlinear s-curve relationship to a linear one so we could estimate it similar to linear regression. We do this using what in statistics is called the logit: logit p = ln(p/(1-p)). We then see that a predictor that has a low p-value is likely to be a meaningful addition to your model because small changes in the predictor's value are related to significant changes in the response variable (winning or losing). Using the MATLAB function for this model (the MATLAB function glmfit using distribution 'binomial' ), you can obtain a vector of p-values , which you can then use to weight the data. In our test run to see if this model would work, we used (1-p) for each predictor as its weight. To run the model in MATLAB, you must separate each round into its own choice problem, as the rounds are not enough alike to group them together, and the MATLAB function for this only works with one vector of outcomes.

We had some trouble implementing this model in later rounds, as there were not enough successes (games) in the later rounds and some coefficients became infinite. To combat this, we used the p-values from the first rounds on the first round, and the p-values for the second round on the remaining rounds. This resulted in a very sub-optimal predictor, as you can see on the predictors page. Because we could not obtain p-values for the later rounds, we needed to choose a slightly different method that would not have this problem: Lasso logistic regression.

Lasso Logistic Regression

Lasso regression is typically used to help combat over-fitting of your training data. This is particularly helpful for us, as over-fitting lead to binomial logistic regression not being a workable solution. Binomial logistic regression would fit infinite beta values in later rounds due to there not being enough data (successes) from which to evaluate the model on. In lasso regression, we are solving a very similar problem to logistic regression, except now we have a regularization term in order to keep the model low-bias (in an effort to keep from over-fitting). The optimization problem from lasso logistic regression is below.

In lasso regression, lambda can be any value from 0 to positive infinity (theoretically, but we hope the value is very small). This lambda value is chosen using a process called Cross Validation, which has its own problem (how many cross-validation folds to include). The lambda in the above equation is a penalty value that is meant to push the weights of beta closer to 0. Lasso is trying to have as few errors as possible in the mean squared error while still having small parameter values.

In order to solve this problem, the correct lambda value needs to be found. To find this lambda value, we use something called k-fold cross-validation. This method randomly splits the data set into k partitions. For each k = 1,2,...K, we fit the model with a parameter lambda to the other K-1 parts, and compute its error in predicting the kth part. This gives us the cross-validation error which is repeated for many values of lambda and the value of lambda that makes CV error the smallest is chosen. Below, there are some pictures that may help explain this k-fold cross-validation method. It is easy to see that changing the total amount of k’s will give you a different lambda. The model needs to be run many times with different k values in order to get the lambda that best describes your data set. Currently in our project, we are using a cross-validation k value of 10. We chose this value by running the model various times, and a very reasonable set of prediction values came out of the 10-fold cross-validation. Once the correct lambda value is found, the model then finds the beta value for each stat that will predict the model the best.

To implement this in MATLAB, we use the function lassoglm, inputting the data, successes,distribution (binomial), Number of Lambda values in which to try, and the k for k-fold cross-validation. Successes have to be implemented round by round since there is only one vector and the rounds are different enough to be considered their own problem. The biggest constraint of the number of lambdas is the time the model takes to run. Currently at 25 lambdas, the simulation takes about 25 minutes. After running, the model outputs the regularized coefficients that are found from the function lassoglm, we will use these coefficients as the weights for our predictor. You can see which teams the model predicts to win by using the glmval MATLAB function. This was done in our code, but it is not useful in weighting the statistics, it is only useful for seeing how well the weights predicted do on the model.

Using Lasso Regression, we were able to come up with a good predictor from the coefficients. We did not include the counting stats (minutes played, total field goals, etc.) for this model since we believe our data from previous was biased since it included the tournament. That means across the board, the champion from the previous year would have higher counting stats which would bias the predictor towards it. To avoid this bias, the counting stats were omitted.

Here, we can see the visualization of breaking a partition into k parts. In this figure, k=5, one subset is held out and a lamda is fitted to the other four. That process is done with numerous combinations of the five.

This is an intermediary equation that calculates the E_k error. This error is used to determine the cross-validation error to the right.

The equation for cross-validation error. The model chooses the lambda that minimizes this value.

Putting it all together

For this predictor, we tried to combine all previous methods mentioned into one method that we hope will be the best one. To do this we looked at the scores of the 3 best brackets. Those being Lasso, supervised PCA, and single stat feature selection.

For the combined predictor, we took the scores that we gave individual teams within individual brackets. Those scores were normalized for each method so that no method would be weighted more than we intended. In some of the predictors, the scores were positive and negative. To deal with this, we divided the round for that method by its absolute magnitude. For instance, if round 3 of Lasso had a -50 term we would divide all the scores in round 3 for Lasso by 50. This process resulted in scores between 1 and -1 for all methods. We also decided to weight the scores based on how their individual bracket did meaning the lasso scores were weighted higher then the supervised PCA scores. Supervised PCA and single stat feature selection were weighted about the same based upon their own brackets success.

To ensure that the predictor didn't pick all the favorites in the first round, we looked at seed facts from the last five years. We found the percentage that each seed would win. That percentage was multiplied by four since there are four of each match ups in the first round. Then we made sure that each seed match up was within one of this prediction. If they were not within one, the closest match up would be found and the outcome would be changed since that match up is most likely to be an upset.