Data

To collect data for our predictor to use, we primarily used sports-reference.com as it has player statistics, game outcomes, and more. It also has data sets available for download as csv files. This makes it really easy to use with MATLAB. Additionally by using the csv files, our predictor can be used for different years easily by loading different data sets. For these reasons, all of the data we used for this project is in csv format.


To find the basketball data we used from sports reference, go to the following link:

www.sports-reference.com/cbb/

We also found some relevant data (BPI RANK, BPI OFF, BPI DEF) on espn.com. To find this data, go to the following link:

http://www.espn.com/mens-college-basketball/


In addition, we also looked at data from other sources. We did not use any of the data here in our final project, but we feel like they are good resources for anyone looking to find college basketball statistics. These links are found below:

hoop-math.com/index.php

https://kenpom.com/


Using the sports-reference.com and espn.com database, we collected the following data.

How Data was collected

The data was collected from the websites above. In order for our code to work, the stats have to be in the same order as the ones below because our code was written around this format. Using a consistent format make it easy to run data from different years through our predictor. Additionally in this format, each pair of two teams play in the first round. We wrote the following MATLAB code to read in the data from sports-reference.com and add it to an inputted spreadsheet that already has the team names in the correct order.

readSRData.m

Normalizing our Data

One problem we noticed early on was that some statistics like total points scored were in the range of thousands and some like the percents were between zero and one. This made our early predictors heavily biased towards the larger magnitude statistics, so we normalized the statistics. In our methods and predictors we found the maximum value for each stat and divided that stat for each team by that number. This made every stat range from zero to one. Josh enter slightly more about this in general and for the ultimate predictor if needed

High Low stats

When we were collecting data, we realized that for some stats higher is better and others lower is better. In order to deal with this we created a binary vector to keep track of whether a given stat is better high or low. This vector was used in all the methods and predictors to ensure stats were not being misrepresented. For all of the statistics, we normalized them as explained above. For the low stats, we subtracted the normalized values from one to make sure the low stats were weighted correctly.

2019 Data

The spreadsheet below has the stats of all 68 teams in the NCAA tournament after selection Sunday. This format is the spreadsheet format we programmed our MATLAB code to work with.

Data after Selection Sunday

Training Data

As training data sets, we collected the same data from the 2014 to 2018 seasons. To assist with this we created a MATLAB script to automatically convert the sports-reference.com csv files into spreadsheets in our standard format. The BPI statistics had to be manually input for each year.

2014 Data
2015 Data
2016 Data
2017 Data
Data From 2018

Specialized Data

For specific tasks, we compiled the following specialized data. The spreadsheet below shows the champions from the past five years singled out.

Past 5 Champions

Here is data about how seeds have done against other seeds. This data is from the past five years, and as such some of the match-ups never occurred (EX. 16 vs. 2), and some match-ups occurred infrequently enough that we did not incorporate them into our final predictor. We only used the stats below for the match-ups that occurred in the first round.

Seed Facts last 5 years