The NBAGOP

By Reece Clark, Tyler Dearman, & Will Jeffries


Image Source: https://www.montanasports.com/high-school-sports/boys-basketball/2019-20-basketball-all-conference-all-state-selections

Is it possible to determine the outcome of a game before it happens? As it turns out, sometimes! Utilizing player data and statistics for every player in the NBA from the last 24 seasons, we trained a network to work up a model that bore a prediction accuracy of 59%, statistically distinct from the 50% accuracy of random guessing. But to know how we did it, you'll need to know about neural networks first.

Why is this important?

This program has uses not only for predicting the outcome of NBA games but also for predicting the outcome in other sports, provided one is willing to collect the data for said sports. In addition, the NBAGOP has applications in recruitment and general management, allowing a coach to run simulations based on the addition of a player or the improvement of a stat from an existing player, allowing them to focus on the strategies that will best help a player contribute to the team. Fans can use the model to gain a slight edge over others in betting. Perhaps most importantly, this is a useful machine learning tool for students learning the subject.

The Data

We collected our data from the NBA website into a series of text files, which are read in to the program. Specifically, we used data under the columns marked "AGE," "GP," "W," "L," "MIN," "OFFRTG," "DEFTRG," "NETRTG," "AST%," "AST/TO," "AST Ratio," "OREB%," "DREB%," "REB%," "TO Ratio," "eFG%," "TS%," "USG%," "PACE," and "PI." The beauty of neural networks is that we (and you) don't need to know what these data are -- the computer figures out which of these will matter! We also collected the outcome of each basketball game for the past 24 seasons from the NBA. After reading them into the program, we divided the stats into a 4-dimensional array, called a tensor: one dimension representing a season, one signifies a distinct pair of teams, one representing a player, and one representing a player's statistics and data. The game outcomes had to go into a 2-dimensional tensor, one dimension representing information such as teams, season, and winner; and the other representing the game in question. We then sliced the tensors up into a training set, a validation set, and a test set. The first 11 seasons' worth of data and games went into the training set, the next 5 went towards validation, and the remaining 8 went into the test set. You'll see below what these sets mean.

To do math on our data, it all had to be numerical, so we had to produce lists of every player and every team and replace those elements of the tensors with an integer, meaning a row like

'Steve Henson' 'DET' 30. 23. 9.4 14.3 ...

had to be converted into something like

15. 6. 30. 23. 9.4 14.3 ...

for every player and every team. It is also important to give the data to the neural network in a form that will allow it to find patterns most easily, so we rescale the data. After rescaling, we can feed it into the network and let the computer do the thinking.

Figure 1

A generic histogram

Rescaling

We can represent the frequency with which a range of numbers appears in a set as compared to other numbers easily using a histogram. Let the histogram of Figure 1 represent a set of similar but high numbers. Taking the average (arithmetic mean) of this set will give us a similar number, and by subtracting it from every value in the set, we shift the histogram such that its mean is 0 and its values are centered around 0. If we take the standard deviation of this set, we can divide each element of the set by the standard deviation and change the width of the histogram. The effect of this is that the set now has an average of 0 and a standard deviation of 1. When we give data to the computer, we want the data to be on equal footing with all the other data we give it so the computer doesn't a) think the high numbers are important, or b) have to figure out that the high numbers are not that important (saving on computing power overall).

Intro to Deep Learning

Deep learning in layman's terms is a multistage way for a program to learn about data representations. The program accomplishes this by training itself upon data that makes the neural network more efficient as well as accurate over several "epochs" or iterations of the model. This is the function of the training and validation sets: the training set is the set of data that the program can look at to devise models that fit the information given. Then that model is tested against an independent set, called the "validation" set. The efficacy of the model on the validation set is recorded, the program is told which validation points it got wrong, and the program returns to training in an attempt to find the common link between the training and validation sets. When it returns to training, a new epoch has begun. Over the epochs, we can graph how the loss and accuracy change as seen below in Figures 2 and 3. Training loss and accuracy describe how poorly and how well the program fairs in producing a model that fits a given set. The red data in the figures give the loss and accuracy of the model on the training set and the blue data give those of the validation set. The accuracy of the program is the percentage of data in which the program is able to accurately predict who wins the NBA match-up while the loss represents how far off the model's predictions are. If the validation loss gets lower and validation accuracy gets higher over epochs it means that the program is improving. In our Figure 2, our training accuracy gets better as it recognizes patterns seen in the training data, but unfortunately, most of these patterns are not common in all NBA games so when the program is given new data from our validation set the program can no longer use those patterns effectively which lead to a worse accuracy. These ideas are also seen in Figure 3 as the program can no longer improve when the patterns it recognizes cannot be harnessed which lead to the validation loss getting worse while the training loss continues to get better. The idea of a program adapting too much to the patterns seen in the training data while not seen elsewhere is known as "overfitting" as it clouds the program's predictions leading to less accurate results for new data as seen in Figure 2's validation accuracy.

Figure 2

Accuracy for the model over 20 epochs

Figure 3

Loss for the model over 20 epochs

Layers

We produced several different models by changing two aspects of the layers we used. A generic layer would look like

model.add(layers.Dense(64, activation='relu', input_shape=(blah,blah,blah)))

where the 64 factor and the activation='relu' were the components we changed.

The number represents the number of dimensions of what is called the "hidden layer." So we have a bunch of numbers in the data and we want to figure out a way to make those numbers a single number, either a 1 for a winning prediction or a 0 for a losing prediction. To do that, we perform transformations on the tensors, but we want transformations that emphasize important features and reduce the unimportant numbers. The best way to do this is to utilize "weights" which are just numbers that signify the importance of an element of a tensor. High-value weights mean the element is important, low-value weights mean the element is unimportant. The number of dimensions of the hidden layer determines how the program will look at the data and what connections it will be able to make. A low value, say 16, will be easy on the computational time, but the program may overlook more complex patterns. A high number, say 256, will be a much greater computational load, but the model will reflect more complex patterns. The risk in a high dimension is that the program may pick up on patterns in the training data that are not really patterns in the general case. Also, we tend to keep the value of the hidden unit a power of 2 because of the way space is reserved in RAM.

The 'activation' argument represents the means by which unimportant data is discarded and important data is emphasized between layers. In terms of spaces, we start out with what we can call an "hypothesis space" defined by the shapes out data can be transformed using linear operations. Often it is not good enough to make linear transformations, because the outcome we want is not found in that slice of the space. Thus we must perform nonlinear transformations to get out of the hypothesis space and into a larger space, and the best way to do that is to get rid of the unimportant and unhelpful data. To do this, we will run the output through a function that spits out a reduced form of the data. The popular ReLU (Rectified Linear Unit) function returns the input for positive values and returns a 0 for negative values (see Figure 4). Another function is the tanh (hyperbolic tangent) function, which returns -1 for low values, 1 for high values, and appears roughly reflexive (outputting the input) for values around 0 (see Figure 5). The sigmoid activation outputs 0 for vary low values and 1 for high values, leaving it particularly well suited for binary classification as it pushes values toward either a 1 or a 0 (see Figure 6).

These dimensions and functions, as well as others, have their realms of usefulness, but we chose to try several of them to see how the models they produced improved. In total we compared 6 separate models; the best models produced an accuracy between 59% and 60% at one epoch or another, with the worst accuracies being between 48% and 49%. The image carousel below shows the loss and accuracy graphs of the 6 models as they evolved over 20 epochs.

When the model is complete, we see how it fairs on the test set, a third set completely isolated from the production and evolution of the model, to see how the model fairs against a the real world.

Figure 4

A ReLU function

Figure 5

A tanh function

Figure 6

A sigmoid curve

Results

We found that every model suffered from severe overfitting after the first epoch. At best performance the model produced accurate results 59% of the time, a value statistically distinct from 50%, meaning the model performed better than uninformed guesswork.

Conclusion

Overall, this project was a lot of fun to work on, but there are still things that could be added to improve upon such as new statistics. These new statistics should include physical factors like height, weight, jump height, etc. so our program could become more rounded out and identify new patterns that could be applied to games. Some of these factors are also much more important then factors that we did use. For example, height is more important than age for most players as height can give a player an advantage while age does not affect a player too much as most players are around the same age anyways. Adding new statistics will allow the program to improve upon its current state and reach a higher accuracy than it has as of now.

In the future, our program can also be harnessed to predict football or soccer games if the data we give the program is altered. This could be interesting to see if the program can achieve a higher accuracy in games other than basketball. The program may be able to recognize different patterns in soccer or another team sport that can allow it to perform with higher accuracy than in basketball. The downfall of basketball being used was primarily how many factors really are at play outside of the scope of factors we has available. Factors such as height, weight, jump height, etc. as well as external factors like how the players are feeling at the time such as lazy or ambitious may have a much greater effect on the outcome of a game, and while emotions are likely unincorporatable, important physical attributes are. Another major thing in basketball is how likely upsets are in NBA games as most teams are evenly matched beside the few teams that can defeat the rest. Most fans would give most games 50/50 odds before the start of a game, because a sport where the outcome is known from the start is a boring sport. Because of this, we believe our 59% accuracy is fairly good and would give users an edge over even informed fans. In conclusion, basketball did not yield outstanding results from our program, but it did yield above average results. It may be that this program yields better results with other sports.

INSTRUCTIONS

Now that you understand a little bit of what's going on behind the scenes and what to expect, you can look at the code and data to play with the model for yourself:

To use the sports data click the "Data" link and then right click "sports" at the top of the screen so you can press "add shortcut to drive" so the program can access the data later on.

To run the code, click "Runtime" in Google Colab and select "Run All." The first cell will prompt you to give permissions to the program to access the files in your Google Drive to access the sports folder. Adjust the path to the sports folder in the last line of the second cell as is appropriate for you.

Data

The Program