Soccer Match Result Prediction and its Application

Motivation

The sports betting market has been steadily growing for the past ten years, and the trend is still ongoing. Currently, the international sports betting market is presumed to have a market capitalization of $250 billion. In such a huge sports betting market, we believed that we could expect a greater profit through machine learning models. Thus, we aimed to build various machine learning models from-scratch, compare their performances in terms of F1 measure and Return On Investments (ROI), and assess its profitability in a simulated betting environment.

Datasets

We utilized the England Premier League datasets publicly available on Football-Data.co.uk. The dataset contain information for each EPL soccer match from year 2008 to 2018. We largely referred to Andrew Carter’s blog post when preprocessing our data.

In order to both utilize classifiers and regression models, we created two different types of target variables : match score and winning teams. i.e. regression models are trained to predict actual scores, whereas classifiers are trained to predict whether the winner of the game would be the home team / away team / draw.

Models

The list of models we built from scratch is:

  • Linear Regression (Vanilla & Ridge)

  • Logistic Regression

  • Gaussian Kernel Regression

  • K-Nearest Neighbors Regressor & Classificer

  • Linear SVM Classifier

  • MLP Classifier

The regression models’ outputs were converted to a classification problem for convenience in evaluation. For instance, if the model predicted a score of 3:1 for Home versus Away, this was converted to “Home Wins”.

Results

Note that BL-SVM and BL-MLP stands for Baseline-SVM and Baseline-MLP. These are our baseline models from our reference, G. Kumar’s masters thesis “Machine Learning for Soccer Analytics”. And also note that this may NOT be a proper baseline in that the data, experiment environment, model hyperparameters were different, and we did not re-implement the models (i.e. the raw performance indices were taken from the paper).

We also carried out a short backtest with the best model, Ridge Regression, to see if we are really making money :)

Here, the Random model randomly selects a winning team for bet. And the Copycat model accesses the betting odds and selects the team with the lowest odds; thus, the team which the most people betted on. The Ridge Regression model performed way better than the random model and the copycat model in terms of Return On Investment. Although we later came to know that bets are usually made in bundles - that is, you need to correctly predict multiple games’ results correctly to win the bet - we were pleased to know that our model performed superior to our benchmarks.