Published as a conference paper at ICLR 2019 | http://arxiv.org/abs/1902.07151

Siqi Liu*, Guy Lever*, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, Thore Graepel


* Equal contributions

Emergent Coordinated Multi-Agent Behaviors through Competition

We study the emergence of cooperative behaviors in reinforcement learning agents using a challenging competitive multi-agent soccer environment with continuous simulated physics. We demonstrate that decentralized, population-based training with co-play can lead to a progression in agents' behaviors: from random, to simple ball chasing, and finally showing evidence of cooperation. Our study highlights several of the challenges encountered in large scale multi-agent training in continuous control. In particular, we demonstrate that the automatic optimization of simple shaping rewards, not themselves conducive to co-operative behavior, can lead to long-horizon team behavior. We further propose an evaluation scheme, grounded by game theoretic principals, that can assess agent performance in the absence of pre-defined evaluation tasks or human baselines.

Open-sourcing Soccer Environment!

We are excited to release the MuJoCo Soccer environment at github.com/deepmind/dm_control/locomotion/soccer as an open-source research platform for physical, competitive-cooperative multi-agent interactions.

If you use this package, please cite our accompanying paper: http://arxiv.org/abs/1902.07151.

Supplementary material for ICLR submission

Sample games

Video 1: birds-eye perspective showing separate reward channels.

Video 2: representative consecutive games from a different angle.


In order to meaningfully evaluate our learned agents, we need to bootstrap our evaluation process. Concretely, we choose a set of fixed evaluation teams by Nash-averaging from a population of 10 teams previously produced by diverse training schemes, with 25B frames of learning experience each. We collected 1M tournament matches between the set of 10 agents.

Figure above shows the pair-wise expected goal difference among the 3 agents in the support set. Nash Averaging assigned non-zero weights to 3 teams that exhibit diverse policies with non-transitive performance which would not have been apparent under alternative evaluation schemes: agent A wins or draws against agent B on 59.7% of the games; agent B wins or draws against agent C on 71.1% of the games and agent C wins or draws against agent A on 65.3% of the matches. We show recordings of example tournament matches between agent A, B and C to demonstrate qualitatively the diversity in their policies

Video 3: Example game play recordings between evaluators in the Nash Averaging support.

Multi-Agent Coordination Probing

Video 4: additional recordings from multi-agent probe tasks.

In the Figure above we show typical traces of agents' behaviors: at 5B steps, when agents play more individualistically, we observe that blue0 always tries to dribble the ball by itself, regardless of the position of blue1. Later on in the training, blue0 actively seeks to pass and its behavior is driven by the configuration of its teammate, showing a high-level of coordination. In "8e10_left" in particular, we observe two consecutive pass (blue0 to blue1 and back), in the spirit of 2-on-1 passes that emerge frequently in human soccer games.