MAESTRO: Open-Ended Environment Design
for Multi-Agent Reinforcement Learning

Mikayel Samvelyan, Akbir Khan, Michael Dennis, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Roberta Raileanu, Tim Rocktäschel

International Conference on Learning Representations (ICLR) 2023

[Paper]

Abstract

Open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks serve as a promising avenue toward generally capable reinforcement learning agents. Existing methods adapt curricula independently over either environment parameters (in single-agent settings) or co-player policies (in multi-agent settings). However, the strengths and weaknesses of co-players can manifest themselves differently depending on environmental features. It is thus crucial to consider the dependency between the environment and co-player when shaping a curriculum in multi-agent domains. In this work, we use this insight and extend Unsupervised Environment Design (UED) to multi-agent environments. We then introduce Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO), the first multi-agent UED approach for two-player zero-sum settings. MAESTRO efficiently produces adversarial, joint curricula over both environments and co-players and attains minimax-regret guarantees at Nash equilibrium. Our experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player games, spanning discrete and continuous control settings.

Approach

Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO) is an approach to train generally capable agents in two-player Underspecified Partially Observable Stochastic Games (UPOSG) such that they are robust to changes in the environment and co-player policies. MAESTRO is a replay-guided approach that explicitly considers the dependence between agents and environments by jointly sampling over environment/co-player pairs using a regret-based curriculum and population learning.

MAESTRO maintains a population of co-players, each having an individual buffer of high-regret environments. When new environments are sampled, the student’s regret is calculated with respect to the corresponding co-player and added to the co-player’s buffer. MAESTRO continually provides high-regret environment/co-player pairs for training the student.

Environments

We evaluate MAESTRO and other baselines in two distinct competitive games:

a sparse-reward grid-based LaserTag environment with discrete actions
a dense-reward pixel-based MultiCarRacing environment with continuous actions

Training environments are generated randomly using procedural content generation. After training, agents are evaluated zero-shot on previously unseen out-of-distribution environments against unseen opponents.

Emergent complexity of autocurricula induced by MAESTRO. Example environments provided to the MAESTRO student agent at (a) start, (b) middle, and (c) end of training. Environments become more complex over time. LaserTag levels (top row) increase in wall density and active engagement between the student and opponent. MultiCarRacing tracks (bottom row) become increasingly more challenging with many sharp turns. (d) Example held-out human-designed LaserTag levels and Formula 1 benchmark tracks (Jiang et al., 2021) used for out-of-distribution evaluation.

Results

LaserTag Cross-Play Results. (Left) normalised and (Middle) unnormalised RR returns during training. (Right) RR returns at the end of training (mean and standard error over 10 seeds).

MultiCarRacing Cross-Play Results: (Left) RR returns (Middle) Cross-play win rate and (Right) grass time between MAESTRO and baselines (mean and standard error over 5 seeds).