Current Status

The first step we took in this project was collecting all the data. We waited until selection Sunday to get data so we would know the 68 teams in the tournament. Once the teams were finalized, we took down all the data we could find on each team in huge spreadsheets. This data included everything from wins and losses to pace of play. After the collection of the data on this years teams, we collected all the data on the past 5 tournament champions. We were able to get the same data from the past 5 champions as we were this years teams. From there, we gathered all the data on all the teams from the 2018 NCAA tournament. The last large chunk of data we took was how seeds in past tournaments fared against other seeds. This data was taken from the previous 5 tournaments.

The next step in our progress was creating and setting up the website. We divided the website into different tabs so that someone could navigate it easily. From there we added pictures and backgrounds to website to make it appealing. Then we transferred the data over from the spreadsheets into the website. The website is still a work in progress as the methods and matlab codes need to be added. The final edits and touch ups will happen after the methods of analyzing the data has been done.

Method and data analysis done:

  1. Made a cosine similarity function to compare these years teams to the past 5 champions. Its referenced more in the data section
  2. Created MATLAB code to convert csv to tournament bracket image
  3. Created way to represent tournament bracket in csv file

Challenges:

  1. Finding which stats from the large spread sheets are the most relevant in predicting who wins
  2. Data collection and conversion from websites to csv files for previous years tournaments
  3. removing all biases from what is currently happening in the tournament

To overcome challenge 1, we intend to combine logistic regression (more on this in the data and plan sections), more rigorous statistical analysis, finding the difference between winners and losers in specific matchups (I.e. Michigan vs. Montana in 2018) to see which stats might be significant in a 1v1 setting, and perhaps methods we have not thought of yet.

To overcome challenge 2, we intend to write MATLAB code (or find another good program/way) that can take the relevant columns from the csv files that are downloaded from sportsreference and combine them all into once csv file. Then we input this correctly fomatted csv file into our general project MATLAB code so that we can have everything organized, and for all the csv files to have the same format, as we can compare team data much easier when all of their stats are in the right order and present.

To overcome challenge 3, we plan to use mostly previous tournament data to come up with a predictor. The plan is to then test our predictor on other previous years' tournaments and optimize our predictor there. Only when our project is essentially finished do we intend to use it to try and predict this year's results, hopefully eliminating the effect of bias from this year's tournament on our predictor.