MAESTRO: Open-Ended Environment Design
for Multi-Agent Reinforcement Learning

Mikayel Samvelyan, Akbir Khan, Michael Dennis, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Roberta Raileanu, Tim Rocktäschel

International Conference on Learning Representations (ICLR) 2023



Open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks serve as a promising avenue toward generally capable reinforcement learning agents. Existing methods adapt curricula independently over either environment parameters (in single-agent settings) or co-player policies (in multi-agent settings). However, the strengths and weaknesses of co-players can manifest themselves differently depending on environmental features. It is thus crucial to consider the dependency between the environment and co-player when shaping a curriculum in multi-agent domains. In this work, we use this insight and extend Unsupervised Environment Design (UED) to multi-agent environments. We then introduce Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO), the first multi-agent UED approach for two-player zero-sum settings. MAESTRO efficiently produces adversarial, joint curricula over both environments and co-players and attains minimax-regret guarantees at Nash equilibrium. Our experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player games, spanning discrete and continuous control settings.


Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO) is an approach to train generally capable agents in two-player Underspecified Partially Observable Stochastic Games (UPOSG) such that they are robust to changes in the environment and co-player policies. MAESTRO is a replay-guided approach that explicitly considers the dependence between agents and environments by jointly sampling over environment/co-player pairs using a regret-based curriculum and population learning.

MAESTRO maintains a population of co-players, each having an individual buffer of high-regret environments. When new environments are sampled, the student’s regret is calculated with respect to the corresponding co-player and added to the co-player’s buffer. MAESTRO continually provides high-regret environment/co-player pairs for training the student.


We evaluate MAESTRO and other baselines in two distinct competitive games:

Training environments are generated randomly using procedural content generation. After training, agents are evaluated zero-shot on previously unseen out-of-distribution environments against unseen opponents.

Emergent complexity of autocurricula induced by MAESTRO. Example environments provided to the MAESTRO student agent at (a) start, (b) middle, and (c) end of training. Environments become more complex over time. LaserTag levels (top row) increase in wall density and active engagement between the student and opponent. MultiCarRacing tracks (bottom row) become increasingly more challenging with many sharp turns. (d) Example held-out human-designed LaserTag levels and Formula 1 benchmark tracks (Jiang et al., 2021) used for out-of-distribution evaluation.


LaserTag Cross-Play Results. (Left) normalised and (Middle) unnormalised RR returns during training. (Right) RR returns at the end of training (mean and standard error over 10 seeds).

MultiCarRacing Cross-Play Results: (Left) RR returns (Middle) Cross-play win rate and (Right) grass time between MAESTRO and baselines (mean and standard error over 5 seeds).

LaserTag Policies

Example Skills

Hiding behind walls

Searching for opponent

Dodge bullets

Running from bullets


Attack from every row


Zero-shot cross-play performances in out-of-distribution (OOD) LaserTag levels.

Multi-CarRacing Policies

Example Skills

Forcing opponent off the road

Overtaking via cutting the corner

Blocking via line adjustments

Blocking by early cornering

Hit and run the opponent

Blocking the opponent's cornering


Zero-shot cross-play performances in out-of-distribution (OOD) Formula 1 tracks (Jiang et al, 2021)