Compact Representations in Google Research Football

The Single Agent Team: People

Mateusz Rychlicki

Mateusz Rychlicki is a Computer Science student. Starting from high school, where he gained a Finalist title in the Polish Olympiad in Informatics, he has been involved in programming. During his studies, he became passionate about Machine Learning and became an active member of the Machine Learning Society. Furthermore, he combines his scientific passions with social activity as a board member of the Students Union.

Mikołaj Pacek

Mikołaj Pacek is a student of Computer Science. His long-developed passion towards Mathematics has led him to become a finalist of Polish Mathematical Olympiad. At university, he quickly became a passionate of Machine Learning. In free time (basically never) plays on bass and drums.

Maciej Mikuła

Maciej Mikuła is an aspiring student of Computer Science and Mathematics. He is passionate about algorithms, mathematics, artificial intelligence, and cutting-edge technologies. Regarding the AI field, he has obtained several Coursera certificates in the fields of Machine Learning / Deep Learning. Currently enrolled in a few courses at University devoted to those fields. In leisure time he attends CrossFit classes and improves his cooking skills.

Poor performance of a certain observation

The particular result we wanted to build on was about observations. There are three distinct observation types in GFootball:

Pixels: a 1280x720 image of the screen
Super MiniMap: four binary 72x96 matrices describing the positions of players from home and away teams, map with the ball, and map with and the active player
Floats (Simple115): a vector of 115 floating-point numbers, describing the players’ coordinates and directions, ball coordinates and directions, ball possession and others.

While the Pixels representation seems natural, it fails due to a large share of noise and enormous representation size. Super MiniMap and Simple115 representations are comparable, although the Simple115 contains more data about the environment. These two approaches have been tested by the authors of the paper.

It turned out that the Super MiniMap representation outperformed the “Floats” (Simple115) representation of game state. We labeled this result as counterintuitive, because floats reveal much more information and provide lower computational overheads. We decided to create a powerful agent that uses the “Floats” representation.

So we began our journey with the experiments. As a main setup, we used the implementation of a PPO2 algorithm from OpenAI Baselines. We ran our experiments on Google Kubernetes Engine.

First Things First: Tuning the PPO and policy

We have performed a number of experiments to tune the hyperparameters for the PPO and the policy, including the number of layers and its sizes. Then, using a 5-layer MLP network with 128 nodes in each layer, we trained two agents: one against an ‘easy’ bot, and one against ‘hard’. While the easy bot was easy to defeat, the performance on hard scenarios was unsatisfactory: after almost 800mln frames, the average reward was below 2, which meant that the average goal difference was below 1.

Our hyperparameters

num_envs 16 (PPO uses parallel environments)
nsteps 512
noptepochs 8
nminibatches 8
lr 1.6e-5
ent_coef 0.003
gamma 0.993
cliprange 0.08
max_grad_norm 0.64

Making it Harder: Increasing the Bot Difficulty

Easy scenarios have an obvious advantage: they rapidly attain high rewards and start producing valid behaviours. But then it overfits. We decided, however, to continuously adapt the difficulty of the scenario, explicitly described as a real number between 0 and 1. When the mean reward of our policy would reach a selected threshold, the difficulty would get increased by epsilon starting from the following round. We have performed a grid search. Below is the comparison of the adaptive and non-adaptive approaches to difficulty.

This approach brought a significant improvement. The rewards at the end of training were substantially higher, even despite having a harder scenario. The adaptive difficulty easily reached the maximum level of 1, surpassing even a constant difficulty of 0.95 of a hard bot. The GFootball paper reports achieving an average goal difference below 2 in 500m steps, and even here, we managed to get a goal difference stay above 2, in 600m steps. Compared to that paper, we received similar results in similar quantity of environment frames.

In this case threshold = 1.1 and epsilon = 0.001

Tuning the Threshold

We ran a grid-based search for threshold and epsilon. We concluded that increasing the threshold should generally improve the results, at least up to the mean reward of 3 (and average goal difference around 2). On the graph below, mean rewards after 790mln steps have been plotted against the threshold.

Tuning the Epsilon

Epsilons were also included in hyperparameter search. Intuitively, the training would be slow if the epsilon is too low. On the other hand, if the epsilon is too high, the rewards would get halted by a ‘ceiling’ related to the particular difficulty. Here, the agent would need to ‘recover’ until it carries on learning. The difficulty curves have been plotted against time on the right, with threshold at 2.5.

Tuning Our Setup

Our computational demands increased exponentially. Initially, performing a single training with 800m steps took 15 days. It was a bottleneck that did not allow us to rapidly develop our models. Based on our usage data, we decided to use 32 CPUs instead of 16, and to run 64 parallel environments instead of 16. We also slightly boosted our PPO and difficulty wrapper. In this setting, the training would end once the mean rewards exceeded 2. Eventually, our training time got reduced from 15 days to 1.5 days; but now it covered only 175m steps.

Further observations

I Don’t Need Checkpoints Anymore

Just to remind, a checkpoint reward gives an additional bonus from 0 to 1 for moving forward with a ball. Initially, it effectively incentivises attacking, but it loses its purpose once an agent starts winning matches. It can distract agents in further stages of training, as they are rewarded for running with the ball rather than scoring goals. Our experiments involved decreasing the checkpoint reward over time down to zero at the end of our training.

The Other Players

In a single-agent setting, the policy controls one player at a time. Others are controlled by the same built-in bot that controls the opponent. Other players have maximum “firepower” by default (with difficulty set to 1). We decided to gradually decrease this “firepower” parameter. However, the evaluation criteria remain unclear, as the mean rewards go down. However, it seems likely that the agent trained with weaker teammates, if given a “smarter” team might outperform standard agents.

Model validation aka private-leaderboard

Having conducted a variety of experiments it was high time we compared their results in a more compound way.

GRF league (aka ‘public-leaderboard’) is an excellent place to test some of your agents by playing friendly matches against other people’s agents. But what about picking up the best among your agents? How to evaluate and pick the best checkpoint once you have plenty of them?

Here comes ‘private-leaderboard’ enabling conducting matches in a more controlled manner.

Keeping in mind our final goal which was climbing up the ladder of GRF league we recreated its environment and logic which enabled us to use the same inference models to play on both leaderboards. Not only has the tool enabled us to get the most out of our models on GRF league but also provided a lot of insight and ideas on how to train our future agents.

Swappers

How to climb up the ladder? Fitting model against other agents!

Having trained our own ‘league of players’ we decided to train our models against them. In order to get the most out of the training so-called ‘swappers’ were introduced. You can think of it as a wrapper for the pool of players. Each swapper follows some logic and picks one of the players which becomes the current opponent for the agent that’s being trained. As the training progresses swapper reacts to the recent events and swaps the opponent accordingly.

Unsurprisingly models provided by this set-up outperform any other model on the ‘private-leaderboard’. As always it’s crucial to keep in mind to prevent overfitting in order to achieve better scores on the GRF league.

Types of swappers

Exemplary types of swapper:

IntervalSwapper - swaps player after a given period of time (e.g. every 64 matches)
EpInfoSwapper - swaps player when state of the training meets a given requirement (e.g. Swapping occurs when the mean reward from last 64 games reaches a given threshold)

Game statistics

Statistical information provides easy-to-reach, valuable insights towards the behaviours of our agents. We created an environment wrapper tracking various information.

Examples of tracked statistics

Goal difference
Number of yellow/red cards of both teams
Time spend in sprint/dribble mode
Number of successful/failed passes of both teams
Percentage of successful passes over time
Accuracy of scoring attempts (shots on goal)
Time spent in each game mode
Percentage of each possible action

Here we see how frequently each action is taken during the training

How shooting accuracy changes during the training

Analysing Private Leaderboard Agents

The Leaderboard generates a lot of data from agent-to-agent matches. Those matches are also frequent: there have been over 1000 games already. Our Statistics Wrapper is helping us to analyse those matches and use this knowledge to improve our agents.

The x axis on the graphs below represent the rank of player on leaderboard (the best player is on the left side and the worst is on the right side) and y axis represent mean statistic for each player. Best fit line is also plotted.

There is no surprise here. The better the player the more accurately he shots

The brutal truth. Fouling is key to the victory!

How similar are our players?

Each square in the heatmap on the right represents cosine similarity between mean statistics vectors, computed for each player separately. There are two types of agents. It turns out that type 1 agents were trained with swappers, and type 2 agents were trained with increasing difficulty of built-in bots. This suggests that swappers help to diversify our agents, and we can work on different ways to bring diversity.

Alternative observations

We discovered that our agents seem to be homogeneous within their own groups thanks to the insights from private leaderboard. This desperate need for diversity can be fixed by altering the observation space

Relative observations

The Floats (Simple115) state representation points out each player’s position as absolute values, relative to the centre of the pitch. We added the coordinates of each player relative to the active player’s position.

Really poor observations

While the key idea of Relative Observations is to enhance our observations with higher quality inputs, the Really Poor Observations take an opposite approach. By deliberately erasing most of the information, retaining only the ball location, player positions and player directions, we hope to gain new insights to the game, using previously neglected data. These agents would likely be worse than full-observation ones, but still attaining decent outcomes.

Page updated

Google Sites

Report abuse