Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames -- games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance.

Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play.

Visualization of behaviors of learned policy in MPE hard

SACL

SP

FSP

PSRO

NeuRD

1.Introduction

Applying reinforcement learning (RL) to zero-sum games has led to enormous success, with trained agents defeating professional humans in Go, StarCraft II, and Dota 2. To find an approximate Nash equilibrium (NE) in complex games, these works often require a tremendous amount of training resources including hundreds of GPUs and weeks or even months of time. The unaffordable cost prevents RL from more real-world applications beyond these flagship projects supported by big companies and makes it important to develop algorithms that can learn close-to-equilibrium strategies in a substantially more efficient manner.

In this paper, we propose a general subgame curriculum learning framework to further accelerate MARL training for zero-sum games. It leverages ideas from goal-conditioned RL. 

Complementary to policy-level curriculum methods like self-play and PBT, our framework generates subgames (i.e., games induced by starting from a specific state) with growing difficulty for agents to learn and eventually solve the full game.

2.Method

3.1 Subgame Curriculum Learning

The key issue of the standard sample-and-update framework is that the rollout trajectories always start from the initial state distribution, so visiting states that are most critical for efficient learning can consume a large number of samples.  To accelerate training, we can directly reset the trajectories from those critical states.

Suppose we have an oracle state sampler oracle(S) that can initiate suitable states for the current policy to learn,  i.e., generate appropriate induced subgames,  we can derive a general-purpose framework in Algorithm 1,  which we call subgame curriculum learning.

3.2 Subgame sampling metric

A key question is how to instantiate the oracle sampler, i.e., which subgame should we train on for faster convergence?

Intuitively, for a particular state s, if its value has converged to the NE value, that is, V_i(s) = V_i^*(s), we should no longer train on the subgame induced by it. 

By contrast, if the gap between its current value and the NE value is substantial, we should probably train more on the induced subgame.  Thus, a simple way is to use the squared difference of the current value and the NE value as the weight for a state and sample states with probabilities proportional to the weights.

Concretely, the state weight can be written as :

However, it is very hard to compute in practice because the NE value is generally unknown. Inspired by this equation, we propose the following alternative state weight :

which takes a hyperparameter alpha and uses the difference between two consecutive value function checkpoints instead of the difference between the NE value and the current value.

3.3 Particle-Based Subgame Sampler

We adopt a particle-based approach and maintain a large state buffer M using all visited states throughout training to approximate the state space.

Since the size of the buffer is limited while the state space can be infinitely large, it is important to keep representative samples that are sufficiently far from each other to ensure good coverage of the state space.

When the number of states exceeds the buffer's capacity K , we use farthest point sampling (FPS) which iteratively selects the farthest point from the current set of points. In our implementation, we first normalize each dimension of the state and then use the deep graph library package to utilize GPUs for fast and stable FPS results. 

3.4 Overall algorithm

Combining the value variance metric and the particle-based sampler, we present a realization of the subgame curriculum learning framework, i.e., the Subgame Automatic Curriculum Learning (SACL) algorithm. An overview of SACL in the hide-and-seek game is illustrated as follows.

4.The Multi-Agent Particle-World Environment

4.1 Environment

We consider the predator-prey scenario in MPE, where three slower cooperating predators chase one faster prey in a square space with two obstacles. 

In the default setting, all agents are spawned uniformly in the square. 

We also consider a harder setting where the predators are spawned in the top-right corner and the prey is spawned in the bottom-left corner.

4.2 Exploitability

All algorithms are trained for 40M environment samples and the curves of approximate exploitability w.r.t. sample over three seeds are shown in the right. SACL converges faster and achieves lower exploitability than all baselines in both settings, and its advantage is more obvious in the hard scenario. This is because the initial state distribution in corners makes the full game challenging to solve, while SACL generates an adaptive state distribution and learns on increasingly harder subgames to accelerate NE learning.

4.3 Cross-play

The results of cross-play at 40M in MPE and MPE hard show that the predator and prey of SACL beat all baselines. 

We show the initial state distributions of the predator in SP and SACL at 40M training steps. We find that in MPE hard, the initial distance between the prey and the predator is too far. As a result, the prey trained by SP learns little about how to stay away from predators and the predators have hardly learned how to catch the prey.

MPE : cross-play

MPE hard : cross-play

Visualization of the state distributions in MPE hard

4.4 Ablation study

Metric: The results in MPE hard show that the metric of SACL is better than other metrics.

Generator: The results in MPE hard show that our particle-based generator with FPS update leads to the fastest convergence and lowest exploitability.

MPE hard : Ablation on metric.

MPE hard : Ablation on generator

Buffer size: The results show that the buffer size K must be large enough to approximate the tasks space accurately. Finally we choose K = 10000.

Subgame sample probability: The results show that we need more samples from the subgame buffer than uniform sampling in the training batch, and uniform sampling from the state space ensures global exploration. Finally we choose p = 0.7.

MPE hard : Ablation on buffer size.

MPE hard : Ablation on sample probability.

Ensemble size: The results show that we can train an ensemble of value functions for each player to improve the estimation. Finally we choose  n = 3.

Weight of the value difference: The results show that our algorithm is insensitive to the weighr of the value difference. Empirically, we prefer alpha less than 1. We finally choose alpha = 0.7 in MPE and MPE hard.

MPE hard : Ablation on ensemble size.

MPE hard : Ablation on weight of the value difference.

5. Google Research Football

5.1 Environment

We evaluate SACL in three GRF academy scenarios, namely pass and shoot, run pass and shoot, and 3 vs 1 with keeper. In all scenarios, the left team's agents cooperate to score a goal and the right team's agents try to defend them. If the left team scores a goal in each step, all left players get a reward of +1 and the right player gets -1.

There are also 10 concentric circles with the goal in the center, called checkpoint regions.

The left team obtains an additional checkpoint reward of +0.1 when they possess the ball and first move into the next checkpoint region, and the right team gets -0.1. Checkpoint rewards are only given once per episode.

pass and shoot

run, pass and shoot

3 vs 1 with keeper

5.2 Visualization of behaviors of learned policy in GRF

pass and shoot

run, pass and shoot

3 vs 1 with keeper

5.3 Exploitability

The first scenarios are trained for 300M environment samples and the last two scenario is trained for 400M samples.

The table lists the approximate exploitabilities of different methods' policies over three seeds, and SACL achieves the lowest exploitability.

5.4 Cross-play

In the three scenarios, SACL is comparable to FSP and PSRO, and better than SP and NeuRD.

pass and shoot

run, pass and shoot

3 vs 1 with keeper

6. The Hide-and-Seek Environment

6.1 Environment

The environment is fully observable and the observation is a concatenation of positions and velocities of all the agents, the box, the ramp, and also the indicators representing whether the box and ramp are locked. In this environment, when the hider is spotted, the seeker gets a reward of +1. Otherwise, the hider gets +1.

6.2 Results

There is a total of four stages of emergent strategies in HnS, i.e., Running and Chasing, Fort-Building, Ramp-Use, and Ramp-Defense. The seeker's NE policy is to use the ramp to get into the room and the hiders' NE policy is to lock the ramp and block the door with the box.

SACL with MAPPO backbone produces all four stages and converges to the NE policy with 50% fewer samples than MAPPO with self-play. This result also shows that SACL can preserve the backbone's convergence property and accelerate NE learning in zero-sum games.

6.3 Illustration of behaviors in HnS

Running and Chasing

Fort-Building

Ramp-Use

Ramp-Defense