Mateusz Rychlicki is a Computer Science student. Starting from high school, where he gained a Finalist title in the Polish Olympiad in Informatics, he has been involved in programming. During his studies, he became passionate about Machine Learning and became an active member of the Machine Learning Society. Furthermore, he combines his scientific passions with social activity as a board member of the Students Union.
Mikołaj Pacek is a student of Computer Science. His long-developed passion towards Mathematics has led him to become a finalist of Polish Mathematical Olympiad. At university, he quickly became a passionate of Machine Learning. In free time (basically never) plays on bass and drums.
Maciej Mikuła is an aspiring student of Computer Science and Mathematics. He is passionate about algorithms, mathematics, artificial intelligence, and cutting-edge technologies. Regarding the AI field, he has obtained several Coursera certificates in the fields of Machine Learning / Deep Learning. Currently enrolled in a few courses at University devoted to those fields. In leisure time he attends CrossFit classes and improves his cooking skills.
The particular result we wanted to build on was about observations. There are three distinct observation types in GFootball:
While the Pixels representation seems natural, it fails due to a large share of noise and enormous representation size. Super MiniMap and Simple115 representations are comparable, although the Simple115 contains more data about the environment. These two approaches have been tested by the authors of the paper.
It turned out that the Super MiniMap representation outperformed the “Floats” (Simple115) representation of game state. We labeled this result as counterintuitive, because floats reveal much more information and provide lower computational overheads. We decided to create a powerful agent that uses the “Floats” representation.
So we began our journey with the experiments. As a main setup, we used the implementation of a PPO2 algorithm from OpenAI Baselines. We ran our experiments on Google Kubernetes Engine.
We have performed a number of experiments to tune the hyperparameters for the PPO and the policy, including the number of layers and its sizes. Then, using a 5-layer MLP network with 128 nodes in each layer, we trained two agents: one against an ‘easy’ bot, and one against ‘hard’. While the easy bot was easy to defeat, the performance on hard scenarios was unsatisfactory: after almost 800mln frames, the average reward was below 2, which meant that the average goal difference was below 1.
Easy scenarios have an obvious advantage: they rapidly attain high rewards and start producing valid behaviours. But then it overfits. We decided, however, to continuously adapt the difficulty of the scenario, explicitly described as a real number between 0 and 1. When the mean reward of our policy would reach a selected threshold, the difficulty would get increased by epsilon starting from the following round. We have performed a grid search. Below is the comparison of the adaptive and non-adaptive approaches to difficulty.
This approach brought a significant improvement. The rewards at the end of training were substantially higher, even despite having a harder scenario. The adaptive difficulty easily reached the maximum level of 1, surpassing even a constant difficulty of 0.95 of a hard bot. The GFootball paper reports achieving an average goal difference below 2 in 500m steps, and even here, we managed to get a goal difference stay above 2, in 600m steps. Compared to that paper, we received similar results in similar quantity of environment frames.
In this case threshold = 1.1 and epsilon = 0.001
We ran a grid-based search for threshold and epsilon. We concluded that increasing the threshold should generally improve the results, at least up to the mean reward of 3 (and average goal difference around 2). On the graph below, mean rewards after 790mln steps have been plotted against the threshold.
Epsilons were also included in hyperparameter search. Intuitively, the training would be slow if the epsilon is too low. On the other hand, if the epsilon is too high, the rewards would get halted by a ‘ceiling’ related to the particular difficulty. Here, the agent would need to ‘recover’ until it carries on learning. The difficulty curves have been plotted against time on the right, with threshold at 2.5.
Our computational demands increased exponentially. Initially, performing a single training with 800m steps took 15 days. It was a bottleneck that did not allow us to rapidly develop our models. Based on our usage data, we decided to use 32 CPUs instead of 16, and to run 64 parallel environments instead of 16. We also slightly boosted our PPO and difficulty wrapper. In this setting, the training would end once the mean rewards exceeded 2. Eventually, our training time got reduced from 15 days to 1.5 days; but now it covered only 175m steps.
Just to remind, a checkpoint reward gives an additional bonus from 0 to 1 for moving forward with a ball. Initially, it effectively incentivises attacking, but it loses its purpose once an agent starts winning matches. It can distract agents in further stages of training, as they are rewarded for running with the ball rather than scoring goals. Our experiments involved decreasing the checkpoint reward over time down to zero at the end of our training.
In a single-agent setting, the policy controls one player at a time. Others are controlled by the same built-in bot that controls the opponent. Other players have maximum “firepower” by default (with difficulty set to 1). We decided to gradually decrease this “firepower” parameter. However, the evaluation criteria remain unclear, as the mean rewards go down. However, it seems likely that the agent trained with weaker teammates, if given a “smarter” team might outperform standard agents.
Having conducted a variety of experiments it was high time we compared their results in a more compound way.
GRF league (aka ‘public-leaderboard’) is an excellent place to test some of your agents by playing friendly matches against other people’s agents. But what about picking up the best among your agents? How to evaluate and pick the best checkpoint once you have plenty of them?
Here comes ‘private-leaderboard’ enabling conducting matches in a more controlled manner.
Keeping in mind our final goal which was climbing up the ladder of GRF league we recreated its environment and logic which enabled us to use the same inference models to play on both leaderboards. Not only has the tool enabled us to get the most out of our models on GRF league but also provided a lot of insight and ideas on how to train our future agents.
How to climb up the ladder? Fitting model against other agents!
Having trained our own ‘league of players’ we decided to train our models against them. In order to get the most out of the training so-called ‘swappers’ were introduced. You can think of it as a wrapper for the pool of players. Each swapper follows some logic and picks one of the players which becomes the current opponent for the agent that’s being trained. As the training progresses swapper reacts to the recent events and swaps the opponent accordingly.
Unsurprisingly models provided by this set-up outperform any other model on the ‘private-leaderboard’. As always it’s crucial to keep in mind to prevent overfitting in order to achieve better scores on the GRF league.
Exemplary types of swapper:
Statistical information provides easy-to-reach, valuable insights towards the behaviours of our agents. We created an environment wrapper tracking various information.
Here we see how frequently each action is taken during the training
How shooting accuracy changes during the training
The Leaderboard generates a lot of data from agent-to-agent matches. Those matches are also frequent: there have been over 1000 games already. Our Statistics Wrapper is helping us to analyse those matches and use this knowledge to improve our agents.
The x axis on the graphs below represent the rank of player on leaderboard (the best player is on the left side and the worst is on the right side) and y axis represent mean statistic for each player. Best fit line is also plotted.
There is no surprise here. The better the player the more accurately he shots
The brutal truth. Fouling is key to the victory!
Each square in the heatmap on the right represents cosine similarity between mean statistics vectors, computed for each player separately. There are two types of agents. It turns out that type 1 agents were trained with swappers, and type 2 agents were trained with increasing difficulty of built-in bots. This suggests that swappers help to diversify our agents, and we can work on different ways to bring diversity.
We discovered that our agents seem to be homogeneous within their own groups thanks to the insights from private leaderboard. This desperate need for diversity can be fixed by altering the observation space
The Floats (Simple115) state representation points out each player’s position as absolute values, relative to the centre of the pitch. We added the coordinates of each player relative to the active player’s position.
While the key idea of Relative Observations is to enhance our observations with higher quality inputs, the Really Poor Observations take an opposite approach. By deliberately erasing most of the information, retaining only the ball location, player positions and player directions, we hope to gain new insights to the game, using previously neglected data. These agents would likely be worse than full-observation ones, but still attaining decent outcomes.