Data

To collect data for our predictor to use, we primarily used sports-reference.com as it has player statistics, game outcomes, and more and also has data sets available for download as csv files. This makes it really easy to use with Matlab. Additionally by using the csv files, our predictor can be used for different years easily by loading different data sets.

www.sports-reference.com/cbb/

One of the first things we did with all of the data we have below is that we took a cosine similarity similar to what was done in Homework 2. We took each team that is in the tournament this year and came up with an angle using the cosine similarity of how similar it was to previous champions in the past 5 years. After comparing a team to each of the 5 champions we took an average of the 5 angles. Since some of the data was larger than others, some being percentages and some being numerical stats, we decided to find the max of each stat and divide the stat by that value. This way all the values we were calculating were rated about the same and one stat was not given a clear advantage. This was the plot we got when plotting the angles.

The top ten teams that were similar to previous champions were:

  1. North Carolina
  2. Kansas
  3. Houston
  4. Florida State
  5. Buffalo
  6. Iowa State
  7. Tennessee
  8. Mississipi St
  9. Purdue
  10. LSU

We have also started doing a logistic regression with individual stats. We are looking at individual stats of teams from the 2018 bracket and seeing how they fared. We plan on using the last 5 years of data,but started with just 2018 for now. We plotted the team's given stat on the x-axis with the fact that they made it to the second round (1) or they lost their first game (0) on the y-axis. The two plots below are of free throw percentage, which did not seem to have an obvious trend, and Strength of Record, which did show some pattern. Teams with lower Strength of Record, which is considered better, generally advanced to the second round. There were some outliers if you look at the graph you can probably guess which dot is Virgina (who became the first and currently only one seed to lose to a sixteen seed). As we continue, we plan on comparing specific matchups. For example, check to see how much of a difference there was between UMBC and Virgina in free throw percentage and see if there appears to be a relationship. For instance, if a team had more than 10% better free throw percentage, did they generally win?

Some other data sources that we found, but are not using as heavily are below.

hoop-math.com/index.php

https://kenpom.com/

http://www.espn.com/mens-college-basketball/

Using the sources above, we compiled the the following spreadsheet of data for teams in 2019 March Madness.

Preliminary Stats

The Preliminary stats were the stats of teams we thought we be in the tournament before selection Sunday happened.

Data after Selection Sunday

These are the stats of all 68 teams in the NCAA tournament after selection Sunday.

Data From 2018

This is the data from the teams in the tournament last year 2018

Past 5 Champions

This is the data from the past five NCAA tournament champions

Seed Facts last 5 years

Here is data about have seeds have done against other seeds. Data from past five years