Deep reinforcement learning (DRL) has achieved super-human performance on complex video games (e.g., StarCraft II and Dota II). However, current DRL systems still suffer from challenges of multi-agent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed and develop an end-to-end learning-based AI system (denoted as TiKick) to complete this challenging task. In this work, we first generated a large replay dataset from the self-playing of single-agent experts, which are obtained from league training. We then developed a distributed learning system and new offline algorithms to learn a powerful multi-agent AI from the fixed single-agent dataset. To the best of our knowledge, Tikick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pre-trained model can accelerate the training process of the modern multi-agent algorithm and our method achieves state-of-the-art performances on various academic scenarios.
Deep reinforcement learning (DRL) has shown great success in many video games, including the Atari games, StarCraft II, Dota II, etc. However, current DRL systems still suffer from challenges of multi-agent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed.
Even much work has been done recently, there remain many problems in building agents for the GRF: (1) Multiple Players: In the GRF, there are both cooperative and competitive players. For cooperative players, the joint action space is very huge, thus it is hard to build a single agent to control all the players. Moreover, competitive players mean that the opponents are not fixed, thus the agents should be adapted to various opponents. (2) Sparse Rewards: The goal of the football game is to maximize the goal score, which can only be obtained after a long time of the perfect decision process. And it is almost impossible to receive a positive reward when starting from random agents. (3) Stochastic Environments: The GRF introduces stochasticity into the environment, which means the outcome after taking certain actions is not deterministic. This can improve the robustness of the trained agents but also increase the training difficulties.
To address the aforementioned issues, we develop an end-to-end learning-based AI system (denoted as TiKick) to complete this challenging task. In this work, we first generated a large replay dataset from the self-playing of single-agent experts, which are obtained from league training. We then developed a distributed learning system and new offline algorithms to learn a powerful multi-agent AI from the fixed single-agent dataset. To the best of our knowledge, Tikick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pre-trained model can accelerate the training process of the modern multi-agent algorithm and our method can achieve state-of-the-art performances on various academic scenarios.
We collect a single-agent dataset and utilize this dataset to learn multi-agent control models. To collect an expert single-agent dataset, we first obtain a single-agent AI, denoted as WeKick, from self-play league training. WeKick took first place at the Google Research Football Competition 2020 and it is the most powerful football AI in the world until now. We let WeKick play with itself and store all the battle data, including raw observations, actions, and rewards. During the self-playing, only the designated player at one side can be controlled by the WeKick, and the designated player is not fixed and is changed automatically according to the build-in strategy. The game will last for 3,000 steps for each round (or episode). At last, we collected 21,947 episodes from the self-play. This dataset will then be used for training our multi-agent offline RL model and other offline RL baselines. The dataset is collected from single-agent control, so it is easy to train a single-agent model with behavior cloning. However, such trained models can not be applied to multi-agent control and many problems will be raised. For example, we find that if we control all the ten players on the court with the same trained model, all the players will be huddled together because all the players tend to get the ball. In single-agent playing, we only need to control the player who is closest to the ball most time. And the designated player is dynamically changed which means the observation inputs for the model are switched between different players. However, for the multi-agent control, we need to control the player who is far from the ball and the observation inputs for each control model are always from a specific player. To handle the differentiation between single-agent and multi-agent playing, we carefully designed the observations, actions, and our learning algorithm. We show the full game video of TiKick (yellow) trained with the offline dataset as below:
TiKick (yellow) plays with build-in AI (blue). All the ten players are controlled by TiKick.
We also train TiKick with the multi-agent reinforcement learning algorithm and achieve state-of-the-art performances on the academic scenarios. We show the videos of TiKick as below:
Hard Counter-attack
Corner
3 vs 1 with Keeper
Run, Pass and Shoot with Keeper
Run to Score with Keeper
Please cite our paper if you use our codes or our weights in your own work: