Michael Schlitzer

Predicting AFL Match Outcomes using Position-based Performance Indicators in a Multilayer Neural Network

Michael Schlitzer

Youtube Overview, Phase 1

The video contains links to help you understand AFL a little better too.

Background and Literature Review

Australian Rules Football is a data-rich sport that is a combination of American football, soccer, basketball, rugby, and even hockey. In summary, the game is played on a Cricket pitch by two teams of 18 players per side. The game begins with a "bounce", akin to a jump ball in basketball. The main point of the game (apart from scoring) is to keep the ball in motion. If the ball comes to a rest, the umpire throws the ball up (again like a jump ball in basketball) to get play re-started. Tackles and kicks out of bounds result in turnovers.

My initial literature review led to these two papers.

Fahey-Gilmour, J., Dawson, B., Peeling, P., Heasman, J., & Rogalski B.; "Multifactorial analysis of factors influencing elite Australian football match outcomes: a machine learning approach"; International Journal of Computer Science in Sport, Volume 18, Issue 3, 2019

Young, Christopher M.; Luo, Wei; Gastin, Paul; Tran, Jacqueline; Dwyer, Dan B.; "The Relationship between Match Performance Indicators and Outcome in Australian Football"; 'Journal of Science and Medicine in Sport' (2018); https://doi.org/10.1016/j.jsams.2018.09.235

Young, et.al. examine the impact of Performance Indicators on match outcome over a 16 year period, first using a Decision Tree model in an attempt to maximize interpretability of the outcome, and then using a Generalized Linear Model (GLM). Fahey-Gilmour, et.al. examined computed categorical and biometric data over a 5 year period (2013 - 2018) and applied 8 different models, including a single level NN.

Further research has shown a robust set of inquiries on a variety of footy-related subjects from the Australian data science community. Many of the investigations surround player position (which is "set" by position, but still very free-flowing) and player "value" for fantasy competitions.

This research has, I believe, provided a direction for a slightly different investigation of the data.

Capstone Proposal

My data:

Publicly available Performance Indicator (PI) data for every game and player over 9 AFL seasons (2012 - 2020), scraped from afltables.com. There are approximately 23 PI for every player in every match for every season.
Biometric data for each player (height, weight, and age), gathered manually and posted to github as a csv file.
Game schedule data - teams, venue, and final score
Player position data - gathered manually

My Hypothesis:

Breaking the PI into position-group, per-game, per-PI aggregates and then applying a multi-layer neural network first to each player group, and then, taking the outcome of those networks to predict a final game outcome (and separately a continuous variable score differential) will exceed the bookmaker benchmark of approximately 71% accuracy and provide insight into the most influential PI by position group.

Two Parts to the Task:

The first part is to use past performance to predict each PI performance to the nearest quartile and then randomize the actual number inside of that quartile range.

Then, use those projected numbers to determine the outcome of matches not yet played, just as the actual data would predict actual match outcome.

The first step is to determine what position group PI most influences match outcome and then what most influences position group PI?

GITHUB Link

Here you can find the Python files (.ipynb files) and database (CSV) files that I used to create the dataset.

https://github.com/michaelschlitzer/2019_Footy_Player_Dictionary

Step 1 - Preparation

The first step is contained in the Prepatory work ipynb file.

After downloading and initializing the player biometric data from github, this file does the key job of importing the data from afltables.com for every match for every team between 2012 - 2020. This takes a long time (approximately 15 minutes). The 2020 season was interrupted by COVID, so it had an unusually low number of rounds. Therefore, I treated 2020 separately and padded it to so that the dataframes would all be of the same length.

I then did some data cleaning with regards to player names before taking on the main task of the prepatory ipynb file.

The data from the web site is organized by player by Performance Indicator (PI). Each web page represents a different round. I needed the information organized as a single PI for every round played in a season.

Then, once I have a common PI, I execute a groupby method, executing a sum function on the PI and a mean function on the biometric data before re-merging the two groups back together. This gives me the total performance and the average Height, Weight, and Age of each player group (Forwards, Defenders, Midfielders, and Rucks) for every round.

Then I create one long row of all of the PI for all 4 position groups to have a dataframe with round as the index and PI as the columns by position.

The final step is to save all of this information as a dictionary of nested dictionaries and export it as a JSON file. This JSON file is then saved on github for easier access in the next step, analysis.

Step 2 - shaping for analysis

The next step picks up with the Capstone ipynb file.

The first step here is to unpack the JSON file, turn the nested dictionaries into a functional dataframe, and do some data cleaning prior to shaping the data for analysis.

Data Cleaning

No team can play without Forwards, Defenders, or Midfielders, but most teams do not have a substitute for the Ruck (Center) position. Therefore, if a Ruck becomes injured during the season, another position player will be called upon to perform Ruck duties in addition to his "regular job".

This happened about 90 times or so over the 9 year period.

The main PI of a Ruck is the Hit Out (HO), so I determined the average number of Hit Outs per position group with and without the Ruck and used that as a baseline for games played without a Ruck. In those cases, I assigned the average number of Hit Outs to either the Forward or Defender who were taking that role and moved the rest of them in to the Ruck aggregate for that game. Then, I took the weight, height, and age (or the average if the Hit Outs were shared between the Forward and Defender) and put that in the Ruck biometric columns.

In the rare instance where subtracting the average would have created a negative number, I took 1 Hit Out from the actual position group and assigned it to the Ruck.

In this way I think that I adequately address the height, weight, and performance disadvantages that those teams would face better than if I had simply left those slots as 0.

Data Shaping

Now that I have every game for every round for every team for all of the seasons in the sample I need to turn that data into something that can be compared.

At this point I merged the PI / biometric dataframe with the Home and Away information to create two separate dataframes. Then I subtract the Away dataframe from the Home dataframe and I get the difference between all of the PI and all of the biometric data, in addition to the unchanged categorical data that was included in the Home and Away files (state of origin, home stadium, game venue, etc.). This file also has the target variable, the final score differential and whether or not the Home team won.

Adding Average Meters Gained

After reviewing the literature from Deakin University, I wanted to add Average Meters Gained (AMG) as a PI. My main source of PI did not include, but I was able to find it from another source. But, the names used were sufficiently different from the other data that it took considerable effort to merge the biometric data, AMG data, and player PI data.

EDA

Everything has a normal distribution

Looking at the data over all the PI for all Position Groups, there is enough data and it all has a normal distribution.

There is variability between team performances

If you look at any PI within a specific year you can see the performance range from round to round. The highs for some teams are lower than the lows for other teams in specific. That matches the "eye test" of games, but the variability will be useful when it comes to making estimates of future performance.

Covariance / Colinearity

I selected the r threshold level at 90% and very few of the PI, taken over the entire data sample, grouped by Position Group are covariant at that threshold.

Originally, I thought that I might be able to use this information to reduce dimensionality, but the key covariant features are basically gaining possession of the ball and disposing of the ball and I don't think that it will be wise to get rid of that information in the neural network approach that I am planning to use.

The Key to the Analysis - game day deltas

Game Data

I have the fixture schedule - what games were played in what round, which team was home, and which team was away, also, where the game was played.

Merging PI with Game Data

Since I have all of the PI broken out by year, team, and round, I created a master Home dataframe with the PI for every Home team and an Away dataframe with the PI for every Away team. The key part of the analysis is not how many of any PI one team has, but the difference between the Home team's performance and the Away team's performance in the same game.

The Features

For my dataset of all games from 2012 - 2020, I wind up with a dataset of 1736 samples x 125 features. 116 of them are PI and 9 of them are categorical.

My target variable is Home win / Away win and a continuous target variable is score differential.

You win the game by scoring more points!

I do have a PI of points scored - Goals (6 points) and Behinds (1 point).

Fortunately, there is a strong linear relationship between scoring more points than the other team and winning the game.

I know that I'll need to decide how to handle scoring, since it is directly related to winning.

One Step Forward, Two Steps Back

As I have worked with this data set more and more I have settled on some more questions that I can try and answer.

Young, et. al. worked with a 15 year sample and found that more recent data had more predictive power and that Average Meters Gained (AMG) in particular was an important PI.

I split up my data and will try and answer four questions:

Establish the baseline for performance: 2012 - 2020 data sample without AMG and not grouped by POSGRU?
How does the full sample (2012 - 2020) without any AMG PI, but grouped by POSGRU perform in comparison?
How does a sub-sample (2015 - 2020) with the AMG PI / POSGRU perform in comparison?
How does that same sub-sample (2015 - 2020) without the AMG PI / POSGRU perform in comparison?

This is, of course, in addition, to my other goals:

Beat the bookie benchmark of about 71% accuracy
Use the historical PI data and the model to predict the outcome of games that have not yet been played, one round at a time.
Can I have similar success / findings with binned final score outcome?

TTest

This analysis was in https://www.kaggle.com/aaronl87/predicting-winner-and-afl-fantasy-points, and I thought it was interesting, insofar as it uses the t-test statistic to provide some sense of feature importance in this big undifferentiated list.

AMG doesn't even factor into the top 25 PI in term of significance!

The p-values are:

.00121
.026
.026

Explained Variance

I also looked at explained variance for each dataset. No matter which dataset you look like, the explained variance graphs are about the same - it takes a lot of the PI to explain the variance.

Machine Learning Models - Setup

In order to maximize comparability I have restricted the sample to 5 years, 2015-2020. This allows me to test the influence of the Average Meters Gained PI.

I have 4 different datasets:

Sample 0 is NOT differentiated by POSGRU and does NOT contain AMG

Sample 1 is NOT differentiated by POSGRU and does contains AMG

Sample 2 is broken out by POSGRU and does NOT contain AMG

Sample 3 is broken out by POSGRU and does contain AMG

Initial Machine Learning Models

The problem, in all of its dimensionality is not easy to separate linearly: it's a game, not a plant or a defined object. Winning and losing can turn on whether a ball drifts just a few millimeters to the right or left.

Still, I created train / test splits (80 / 20) with a common random seed for all of the samples and ran the following.

Naive Bayes
Logistic Regression
Support Vector Machine
Decision Tree
Random Forest (with n_estimators = 100)
XGBoost (a type of Random Forest)
Multi-layer Perceptron-based Neural Network (Rauschka)
TensorFlow Neural Network (Here I introduced a validation set where the breakout was 80 train, 10% validation, and 10% test). My final results are reported on model performance on the test set only, not the validation set.

An Interesting Diversion

My T-Test and Explained Variance tests had indicated that certain PI were more influential / important than others.

From the Random Forest it is also possible to grab the key features, marry them back up to the DF and see what features seem to have the most influence.

I cross-referenced the lists of the top 25 features (an admittedly arbitrary number), but running the classifiers on this reduced feature list did not improve performance across the board and for all classifiers.

I may try to increase this number of features (while still remaining below the 106 total).

Initial Test Classification Results

Running all of these classifiers on the same data produced some interesting results.

My original idea of treating each POSGRU with its own neural network and then concatenating the layers together did not perform anywhere near as well as even the simplest linear classifiers.

I do think this is because "the magic" is in the mixture. No one POSGRU influences the win by itself. Everything has to work together.

The red line is the "gambling line". EVERYTHING performs well above the gambling line.

The gray line at the top is the best performance reported by Young, et.al. Very few of the models break that line.

Most interesting, Sample 0, the data that is undifferentiated by POSGRU outperforms the other samples, often by a large margin.

Tuning the TensorFlow Model with KFold

I ran extensive grid searches on every tunable hyperparameter for every dataset in my TF model.

I tuned and optimized:

batch_size
epochs
optimizer
learn_rate
momentum
init_mode
activation
weight_constraint
dropout_rate
layers
neurons

I learned that:

More layers doesn't always help and often decreases performance.
Reducing the number of initial neurons forces the model to select the best, most significant inputs at the start, which is helpful.
A small dropout layer can help to control outliers

Still, with all of this tuning I only improve the performance of the TF model a little bit.

Nothing performs as well as some of the simpler classifiers (Logistic Regression, SVC, and my simple Neural Network).

Tuning the TensorFlow Model
with Optuna

In most Scikit Learn regressors there are just a few hyperparameters to tune and Kfold can handle that degree of variability. When you move to a Neural Network and TensorFlow, the number of and interaction between the hyperparameters escalates. What was almost possible to visualize and conceptualize becomes impossible to track.

Optuna works by tuning lots of things together, dropping combinations that don't appear to be working and focusing more processing power on those combinations that do appear to be adding marginal value.

neurons = trial.suggest_int('neurons', 10, length)
momentum = trial.suggest_float('momentum', 0.0, 1.0)
learning_rate_init = trial.suggest_float('learning_rate_init', 1e-5, 1e-3, log=True)
initializers = trial.suggest_categorical('initializers', ['uniform', 'lecun_uniform',
'normal', 'zero', 'glorot_normal',
'glorot_uniform', 'he_normal',
'he_uniform'])
activation_methods = trial.suggest_categorical('activations', ['softmax', 'softplus', 'softsign',
'relu', 'tanh', 'sigmoid', 'hard_sigmoid',
'linear'])
weight_constraints = trial.suggest_int('weight_constraints', 1, 5)
EPOCHS = trial.suggest_int('epochs', 20, 100)
BATCHSIZE = trial.suggest_int('batch', 10, 60)
optimizers = trial.suggest_categorical('optimizer',['SGD', 'RMSprop', 'Adagrad', 'Adadelta',
'Adam', 'Adamax', 'Nadam'])
n_layers = trial.suggest_int('n_layers', 1,3)

I ran Optuna on my TensorFlow sequential models for all 4 dataset samples and Optuna did what Kfold or guessing alone could never have achieved.

For Sample 0 (no POSGRU differentiation, no AMG)

For Sample 4 (POSGRU differentiation with AMG)

Cleaning up

In my previous model I had done a separate train / test split on the TensorFlow models to accommodate a validation set from the main data set. But, with the 2021 season underway, I created a separate, dedicated validation dataset from the first 4 rounds of 2021 (36 matches). I also changed the random_seed to a different number to grab a fresh split on the data.

Having done this, I went back and reorganized the TensorFlow models to use the same 80 / 20 train / test splits across all models. I was also able apply the predict method on the fitted validation set across most of the models, as seen in the graphs below.

Drawing Conclusions

Most simple ML classifiers cannot handle the increased dimensionality created by breaking PI out by POSGRU. So it is not until we begin to implement a Neural Network approach that the models can cut through what I consider to be the interesting noise of the additional data.

Research Answers:

I addressed my research questions using McNemar’s test for statistical significance. I used McNemar’s rather than the simple t-test because of the k-folds in my general linear models – even though the train / test splits were the same across all classifiers, the actual data that produced the optimized hyperparameters differed. I think the McNemar’s test takes this variability into account and allows me to focus on the outcome of each classifier.

Question 1: Does adding AMG to the sample improve model performance? NO

So, when I compared the performance of all classifiers on sample 0 to the performance on sample 1, to evaluate the value that adding AMG provided, I was unable to reject the null hypothesis. The smallest p-value for any classifier was .256, so adding AMG into the mix did not significantly improve the performance of any classifier.

Question 2: Does the POSGRU breakout improve model performance? NO

Looking at my second question: did breaking out the PI by POSGRU add any significant value, the answer was similarly definitive. Here I compared the performance of sample 0 to sample 2 and sample 1 to sample 3. The only classifier that showed a significantly significant difference was the Decision Tree classifier and that broke the wrong way, meaning that the aggregated data was significantly better than the POSGRU breakout for that particular classifier.

So, again, I could not reject the null hypothesis and the POSGRU breakout was not only not as helpful as I had hoped it would be, but was, in fact, detrimental to the overall model performance.

Question 3: Are performance differences across ML classifiers statistically significant? YES

The first two research questions are vertical analyses, meaning that we asked the question within each classifier. I also wanted to evaluate significance horizontally on one data sample across all the classifiers. We have already noted that the performance range is fairly narrow, but when I applied that same McNemar’s test horizontally I found that a few classifiers did provide a statistically significant performance improvement over others.

As you might expect, the Neural Network models were all significantly better than the baseline Naïve Bayes classifier and the similarly poor-performing Decision Tree classifiers. The differences in performance between all the other classifiers (which are really in a fairly narrow range) are not statistically significant.

P-Value Table for Classifier Performance on Sample 0

P-Value Table for Classifier Performance on Sample 1

P-Value Table for Classifier Performance on Sample 2

P-Value Table for Classifier Performance on Sample 3

Graphical Representations

EDA on the Next Steps

I have been in contact with Champion Data, the official stats provider for AFL and believe that I can get a much more detailed data set that will allow me to explore different dimensions of the game by location on the field and hopefully unlock different targets rather than just W / L.

One of the flaws in my Neural Network is that every POSGRU is measured against the final W/L and, quite frankly, it is a stretch to think that every Ruck Hit Out is going to have a significant impact over the course of an 80 minute game.

And, the bar for accuracy is already quite high, but I do think that a more finely tuned Neural network can wring further insight out of the data to explore what the announcers call "our great game" of Australian Football.

Sample of Champion Data PI

The level of detail for just one game is incredible - 920 samples x 230 PI features! Each player's PI is broken out by quarter and by position on the field, allowing for a deep dive into the game. While the data is not sequentialized or time-stamped, it does permit me to look at contests by location on the oval, which is something that I couldn't do before.

A first look at the possibilities

By combining my POSGRU data with this Champion Data sample and just looking at one group of PI (Contested Marks) you can begin to see how this data almost paints a picture of what is happening for a team in a particular quarter.

There is so much data available in this sample that it radically changes the possibilities for analysis. I am excited to see where this goes in the future.

Final Wrap up Video

This is a 20 minute recap of the entire project.

Page updated

Report abuse