MLS Project

By Brooks Stemple, Gio Santibanez, and Will Nowatka

Can we use a Neural Network to Predict a team's total wins in a MLS season?

Can a computer predict the number of overall MLS club wins in a season? According to our model, it is possible! Utilizing statistics from each team in the league from the last 10 seasons, we were able to train a neural network to run a model that was able to predict an MAE of 2.11. This means that on average the model can predict about 2 games away from the actual number of wins that a team will have in a season: which is pretty impressive. Down below you'll learn how neural networks work and how we were able to obtain the results.

Why is this important?

Instead of predicting individual games, our, model focuses on how seasonal statistics can predict overall seasonal records for teams in the MLS. Every single host on ESPN, sports blogs, beat writers, and the fan is trying to predict whether their team will have a good enough record to make the playoffs. Fans and sports media are also trying to predict draft position which is also influenced by the number of wins in a season. A successful prediction model could effectively use seasonal statistics and create an output for users to make an accurate prediction on club performance. Additionally, there has been an increased popularity in sports gambling in 23 states and D.C.; A successful model could be used as a tool for sports betting. This creates an ethical dilemma surrounding machine learning and gambling. Lastly, this model could be used with other leagues and sports, using specific statistics to predict the overall record.

Background on data.

We collected data from the MLS website because it is well organized and we were able to easily retrieve the data however, there was some missing data before 2010. (columns of stats were all 0s) We started by creating separate spreadsheets for each season from 2010 to 2019. Then we converted the data file into a CSV file for the program to read it. Eventually, all of the data was pasted into a google sheets document because it was easier to import into the code. We divided our data into four parts; general, passing, attacking, and defending to illustrate different aspects needed to win a soccer match. Specifically, we used data under the columns marked “PTS”, “GD”, “Pass%”, “A”,“ G”,“ SHT%”,“ GA”,“ INT”, “FC”, “W”, and “L" on mlssoccer.com/stats/clubs. The neural network figured out which statistics were the most important in determining the wins of a soccer match.


Soccer Term Glossary:

  • Goal difference: The total difference between goals for and goals against a team over a season.

  • Goals: Total number of goals scored for the team over a season. A goal is when the ball passes over the goal line.

  • Goals against: Total number of goals scored against a team over the season.

  • Passing percentage: Passes completed to teammates over the number of passes attempted.

  • Assists: a pass leading directly to a goal.

  • Accurate shooting percentage: Percentage of total shots placed on target (the goal).

  • Interceptions: A pass intended for the opposing team that is stolen by the current team.

  • W: Total number of games won in the season

  • L: Total number of games lost in the season

  • Shot percentage: The number of shots attempted on goal

  • Fouls committed: The number of fouls a team committed over the season

  • Points: The total number of points earned by each team. (3 points for a win, 1 point for a tie, and 0 for a loss)

Intro to Machine Learning

Network Architecture:

In Machine Learning, typically a network is made of an input layer, hidden layers, and an output layer. The connections between the hidden layers have weights and biases. These weights and biases cause certain nodes on the network to become more active when a given sample of the data is input to the network. This will affect the output of the network, which is usually a number, a probability, or a set of probabilities. Almost like different parts of your brain lighting up responding to different stimuli and then you making a decision based on those stimuli.

Loss Function

The Loss Function attempts to get the loss as low as possible and is used to adjust the weights and biases of the network. This can be visualized as a ball rolling to the lowest point (as epochs pass). However sometimes the ball might get into a local minimum and the network could perform better, but the loss function has gotten stuck. The graphic below is a good visualization.

https://www.pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/

Overfitting

One thing to be very careful of in Machine learning is overfitting. Overfitting is when a network gets too good at identifying the training data and no longer performs as well on test data. This is like memorizing the test answer key but then forgetting how to answer the questions. So if a detail is different then the network will struggle to identify that example.

The Data

As mentioned before, data was collected from the MLS website and categorized into four parts; general, passing, attacking, and defending. To measure performance under general stats we used total goal difference and fouls committed. To measure a club's ability to pass we used passing percentage and assists. For attacking performance we used goals and accurate shooting percentage. Lastly, to measure defensive performance we used goals against and interceptions. We also collected total wins and losses for each season to observe how accurately the computer could predict wins.

These statistics were collected over the 2010-2019 seasons, we could not go past 2010 because the MLS website had insufficient data and filled in zeroes for many statistical categories for earlier years. The model uses seasons 2010-2017 and the corresponding statistics as training data and the 2018 and 2019 seasons as test data to compare the computer's predictions of wins versus the actual number.

When formatting our data we had to eliminate the team names and identify them as numbers to get the code to work in python. Here are examples of our training and testing datasets. The two images show the eight parameters used to predict overall wins in a season with numbers representing different teams.


Our Network

Since the code in our model was adapted from the housing price exercise (homework 3) in class, it has a similar network as well.

We used a small network with two hidden dense layers with 64 units each. A dense layer means every point is connected to every point in the previous layer. This network ends with a single unit and no activation, it is a linear layer. This means that it can generate any number as an output prediction as opposed to other networks that are constrained to between 0 and 1. This is helpful because we are not predicting the probability of a win but instead the total number of wins in a season.

We built a network with the MSE loss function. This stands for Mean Squared Error, it is the square of the difference between the predictions and the targets, a popular loss function for regression problems like our model. During training we are also monitoring MAE, the Mean Absolute Error. It is the absolute value of the difference between the predictions and the targets (actual seasonal wins, in this case).


Background on the neural network we chose, show visuals if possible. This includes information on python

This is a visual of the simple neural network we used in our model. An input layer with two hidden layers with 64 units each.

https://miro.medium.com/max/791/0*hzIQ5Fs-g8iBpVWq.jpg

Deep Learning and the Model

Deep learning uses a computer to learn data representations in a multistage process. This model does that by using the training data from 2010-2017 to make itself more accurate over "epochs" or iterations of the model. The purpose of the training set is to teach the computer patterns in the data, that way on the validation set (or test set) it can accurately produce an output. In this program figure 1 shows how the validation mean average error lowers significantly in the first 25 epochs. The model reaches its lowest MAE around the 75th epoch before staying consistently steady. This indicates that the model becomes very accurate after the first 25 epochs with an MAE of about 2.5.,the model finished with an MAE of 2.26. In figure 2, further evidence of the mean average error is shown by the histogram of the differences between the prediction and actual number of wins for each team. Most of the predictions are either 4 lower or higher than the actual win total for a season. In 2019, each team played a total of 34 games. Relative to the amount of games in a season, the MAE and the histogram in figure 2 indicate that the computer was fairly accurate. In a close playoff race, two to four games can make a big difference which shows some of the limitations of the model however, when predicting overall records it is a good result. In the model we did see an interesting outlier in the data. Figure 3 shows a plot of the distribution of pass percentage amongst clubs. The team that performed poorly compared to the rest of the league was the New York Red Bulls with a passing percentage of 68.6% in 2019. Interestingly enough, the Red Bulls were not one of the worst teams in the league and won a mediocre 14 games in 2019. Lastly, figure 4 examines how impactful the parameters are in our model. The numbers on the X axis indicate the eight parameters in the model, staring with goal difference as "1" . The plot indicates that if you were to remove goal differential or passing percentage, the mean average error would increase by 0.5. Oddly enough, if we were to remove "goals against" in our model, it would lower the mean average error by 0.2.



Explain deep learning, validation set, training set, MAE, and interpret the outputs of our model. Include any layers that we used and be sure to explain, I don't think the layers we used were that complex?

Figure 1. Validation MAE over 200 epochs.

Figure 2. Differences between actual and predicted win totals.

Figure 2 cont. Plot of the computers predictions and the actual win totals for each club over the 2018 and 2019 seasons.

Figure 3. NY Red Bulls outlier on passing percentage.

Figure 4. Parameters effects on mean absolute error.

Implications

Represented below are the plots of our most important statistics which were goal differential and pass percentage. There's no surprise as to why these would have the greatest effect on our model. Figure 5 shows the pass percentage plot and there is a cluster of points on the right side of the graph. This means that most teams have a high passing percentage. The points on the left are from the 2010 season having 0's as pass percentage for every team in the data. A possible explanation for a high pass percentage could be that these teams are made up of the top tier players who hardly make mistakes which makes it easier for them to be able to keep possession of the ball. Another reason why pass percentage is so important is that if a team is not dominating possession then they will most likely not win because you need the ball to score goals. Figure 6 is the goal differential graph and there is a clear linear relationship between wins and goal difference. This makes complete sense because a team can't win games without scoring goals. The best teams in the league can consistently score many goals. This is why as you move farther to the right on this graph you can see that teams with a larger goal difference will tend to have more wins.

Figure 5. Wins and Pass percentage.

Figure 6. Wins and Goal Difference.

Results

We had a MAE of 2.11. This means our model could confidently predict the number of wins with a 2 win margin of error. This was impressive given we only had data from 2010-2019.

Conclusion

Overall, we really enjoyed using this example to further our understanding of machine learning. For some of us being new to this subject, it was helpful to be able to use a simple example like sports and see the capabilities of machine learning. As this project comes to a close, we believe there are a few things we could add to improve our model. One factor we would like to input into our model would be the payroll of each team. Gio mentioned in our presentation that there as been an increase in foreign players playing in the MLS, but these players are often paid more. It would be interesting to see if higher spending on players would lead to more wins or if it is a nonfactor relative to the stats observed. Additionally, this model could be extended into other sports and leagues such as the NBA, NFL, and NFL to predict win totals. Another statistic we could input into our model would be penalty kicks or percentages related to it. Penalty kicks represent big scoring opportunities in soccer and could lead to more wins.

One thing to acknowledge is the result in figure 6, indicating that removing "goals against" would lead to an improvement in mean absolute error. Since this is not what our group expected we may want to check and see if our parameters have a high correlation with each other. Multicollinearity in regressions can skew results. Lastly, our model is limited to what is measurable in statistics, this does not take into account other factors such as motivation and rivalry games where previous stats may not matter as much.

Our Code:

Sources and Acknowledgements:

AmericanGaming. “Interactive Map: Sports Betting in the U.S.” American Gaming Association, September 2, 2021. https://www.americangaming.org/research/state-gaming-map/.

LoRé, Michael. “Soccer's Growth In U.s. Has International LEGENDS BUZZING.” Forbes. Forbes Magazine, April 30, 2019. https://www.forbes.com/sites/michaellore/2019/04/26/soccers-growth-in-u-s-has-international-legends-buzzing/?sh=1cc5198917f1.

Mlssoccer. “MLS Club Stats.” mlssoccer, 2021. https://www.mlssoccer.com/stats/clubs/.

Smith, Don. “Google Colaboratory.” Google Colab. Google, August 1, 2021. https://colab.research.google.com/drive/1QFB-gH8FGGqBUGJpOMdabMYgCcRC6Wg1.


Images: