My project is about football and specifically my project is about betting on football. Football betting is perfectly situated to be solved by machine learning algorithms. Combine more data available than a human mind could possibly analyze with a discrete goal and you have the perfect machine learning problem.
Sports betting may seem complicated but it is actually quite simple. In football betting there is what’s referred to as a three way bet. You can bet on the home team to win, the away team to win, or a draw. That is the type of outcome I will attempt to predict. There is however another component to the bet, the most central component, odds. Odds are the means by which a sports better can calculate the take and risk a corresponding amount. A heavy favorite will payout considerably less for the same risk than an underdog or a team who is not expected to win. Therefore the confidence level needs to be much higher when taking a bet on a heavy favorite.
I chose this project because I am passionate about sports. Sports are the initial reason I became interested in data science. Data and sports is a marriage made in heaven. Data affects everything in sports from the game theory, to roster construction strategy and even on to handicapping and betting sports. The best part about this marriage is that the data is publicly available and mostly free.
The research question is what inputs need to be combined in order to create a model using machine learning or conventional programming in order to produce a profit. My hypothesis is that efficiency over average is a key predictor of outcomes and therefore can be leveraged to create a profitable model. Efficiency over average is a term coined by Football Outsiders a group name so because they took an approach to analyzing the game without the bias an insider would have. They state concerning their method “[It] is a method of evaluating teams, units, or players. It takes every single play during the NFL season and compares each one to a league-average baseline based on situation. “ [1] Applying their approach to Premiere League Football I could take two goals. One from Team A against Team B, another from Team C against Team D. Which was the better goal. That is the question they answer. Let’s assume Team D had a significantly better defense than Team B. Therefore Team C’s goal was better than Team A’s in that specific instance.
State of the Art
I researched the state of the art then I analyzed two approaches to the problem one using conventional programming and one using machine learning. They articles can be found on the links section of this page.
The best article I came across in my research for summarizing the state of the art is titled, A machine learning framework for sport result prediction.[2] In section 2 of the article 'Literature review and critical analysis' the authors Bunker and Thabtah, discuss different studies that approach sports predictions using classification through Artificial Neural Networks (ANNs). The oldest approach they discussed was done by Purucker in 1996. He used five features from 8 rounds of games and achieved 61% accuracy in predicting outcomes compared to 72% accuracy from subject matter experts. Purucker used an ANN with a backward-propagation (BP) algorithm to create his model. In 2003 Kahn was able to beat Purucker’s accuracy result with a 75% accuracy score which beat subject matter experts score of 63%. Khan used 6 features and used a longer training period going 14 rounds instead of 8. Mcabe and Trevathan built on that work. They used, “A multi-layer perceptron, trained with BP and conjugative-gradient algorithms was used. The ANN had 20 nodes in the input layer, 10 nodes in the hidden layer, and 1 node in the output layer (20-10-1).”
The most novel approach that Bunker and Thabtah discuss is one performed by Tax and Joustra. They used betting odds alone to predict outcomes rather than betting odds and past performance. They used a few different types of classifiers. The naive Bayes and ANN classifiers performed the best correctly predicting outcomes at a clip of 54.7%. So using only betting odds one can predict an outcome of an event at a better than random but not as well as the previously discussed approaches.
My approach will be to use ANN classification with BP and see if the model I create can match the performance of the models previously discussed.
I went on to analyze the code for two approaches:
The conventional programming approach was described thusly: “They propose that you can look at past results and scores between different teams within the same league system and from these past results be able to predict future scores and results.” [3]
They built a basic version of the model using csv, math, ast, and numpy in Python. The model returned a 5% return on investment in the first season tested but when taken in snippets or other seasons the return can be negative.
This is likely because the approach is a bit too simple.
The machine learning approach leveraged ANNs with Keras. That approach returned 5% in training sets, and 8% in testing sets. This approach works but there is always room for other profitable models. Especially in a landscape like handicapping where sharp betters and odds makers constantly have to adjust to minimize inefficiencies. A model that works for odds now may not work next year if the odds makers change their approach.
The Data I will use to train and test my model is game results from the last four seasons of English Premier league soccer. The data includes date, teams, scores, results, odds, and other match statistics. Each set contains information for all 381 games making the total set over 1,500 rows of data. The data as well as a notes document defining the values can be found here in my github repository.
Initially I did some simple exploration on the data that can be found in my github repository. Also as a proof of concept and a practice exercise I took a snippet of code from the conventional programming approach and replicated it using Pandas. This allowed me to test a very simple theory of taking home dogs. Or taking bets where the home team has higher paying odds than the away team. The theory did pay off with the test set that I used, but was very high in variance, so not a suitable long term approach.
Data Prep:
My hypothesis is that efficiency over average is a key predictor in Football outcomes to test that I needed to calculate that statistic from the match data. To do this I averaged home goals, home goals conceded, away goals, and away goals conceded, -per match/to game date/per team. The next step was to take every line item and find the difference between the teams score and the opposing teams average goals conceded and goals scored. Finally I took averaged those differentials to game date per each team. This left me with four features I felt confident in initially training my model with. Prep can be found in Phase 2 folder of my Github repository
Exploration
My initial exploration showed that the home team won about 48% of the time
I also used a pandas scatter matrix is plotting each of the columns specified against each other column
Team specific stats were negatively correlated i.e. teams that scored many goals often gave up very few goals. home and away team stats were not correlated with each-other but I predict had a strong influence on the target variable.
Initial Training
I separated feature and target the shuffled and seperated training and testing data.
I then used scale from sklearn to to make all the variables the same order of magnitude.
I used three classifiers on my data set
Logistic regression- predicts the chance of an event by plotting features the fitting a logistic curve
SVC (Support Vector Machine) -Calculates the max margin between positive and negative target data
XGBOOST is a classification and regression tree essentially a decision tree
Results
When predicting soccer outcomes random chance is 1/3. As soccer is a three outcome result, win, lose or draw.
My models scored:
Logistic regression: Train- 65% Test-70%
SVC: Train- 67% Test-70%
XGBOOST:Train- 86%Test- 70%
My models preformed very well
Next Steps
Focus on XGBOOST and tune parameters
Consider adding features
Test on more seasons
Use odds to see if model is not only predicting winners but also profitable.
In my initial training my most successful classifier was xgboost so I will focus on tuning that classifier to improve accuracy in the test set.
scikitlearn hides many parameters for ease of use, but using a grid search function allows us to improve our accuracy score by trying every combination of parameters
It can be computational expensive so working with larger datasets can be prohibitive
2017: Scored 62% on test set
2018: Scored 72% on test set
2019: Scored 50% on test set
2019 Test data was split with project restart
To calculate the best bets we need to find the difference between the odds our model produces and the implied odds form the betting line.
To calculate our odds train the model using the same process as previous seasons then use predct_proba to find a layer deeper then just the predicted target
American Bookmakers use a ratio to 100 to determine payouts, the payouts imply odds of the outcome.
To calculate:
If american odds are less than 0 then abs(n)/(abs(n)+100)
If american odds are greater than 0 then 100/(abs(n)+100)
So -110 has what implied odds of 52%
0 represents the total risk baseline
Total risk for Outcome Prediction 199.50 with a take of 186.75 for a 93% ROI
Total risk for Implied Odds Differential 150.00 with at take of 36.75 for a 24% ROI
In test data over 3 seasons we scored above or significantly above random chance therefore the chosen features had solid predictive value
In go live we were profitable in both implied odds differential and using the outcome prediction only
[1] Methods To Our Madness. (n.d.). Retrieved October 27, 2020, from https://www.footballoutsiders.com/info/methods Football Outsiders
[2] Bunker, R. P., & Thabtah, F. (2019). A machine learning framework for sport result prediction. Applied Computing and Informatics, 15(1), 27-33. doi:10.1016/j.aci.2017.09.005
[3] How to Create a Football Betting Model using Python and Poisson. (n.d.). Retrieved October 27, 2020, from http://www.bestbettingonline.com/strategy/create-model/
[4] Malafosse, C. (2019, October 11). Machine Learning for Sports Betting: Not a Basic Classification Problem. Retrieved October 27, 2020, from https://towardsdatascience.com/machine-learning-for-sports-betting-not-a-basic-classification-problem-b42ae4900782s
[5] England Football Results Betting Odds: Premiership Results & Betting Odds. (2020, November 26). Retrieved October 27, 2020, from https://www.football-data.co.uk/englandm.php
[6] Raval, Siraj. (2017, August 23).Predicting_Winning_Teams. Retrieved October 27, 2020, from https://github.com/llSourcell/Predicting_Winning_Teams/blob/master/Prediction.ipynb
[7] M.C. Purucker, Neural network quarterbacking, IEEE Potentials, 15 (1996), pp. 9-15
[8] R. Mohammad, F. Thabtah, L. McCluskey, An improved self-structuring neural network, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Auckland, New Zealand (2016), pp. 35-47
[9] N. Tax, Y.P. Joustra, Predicting the Dutch football competition using public data: A machine learning approach, Trans. Knowl. Data Eng., 10 (10) (2015), pp. 1-13
[10] A. McCabe, J. Trevathan, Artificial intelligence in sports prediction, in: Information Technology: New Generations, ITNG 2008. Fifth International Conference on, IEEE, 2008, pp. 1194–1197.
[11] J. Kahn, Neural network prediction of NFL football games World Wide Web Electronic Publication, 2003 (2003), pp. 9-15