RPM: Generalizable Behaviors in Multi-Agent Reinforcement Learning

ICLR 2023

Code

Paper

Abstract

Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in the evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the problem with Markov Games and propose a simple yet effective method, ranked policy memory (RPM), to collect diverse multi-agent trajectories for training MARL policies with good generalizability. The main idea of RPM is to maintain a look-up memory of policies. In particular, we try to acquire various levels of behaviors by saving policies via ranking the training episode return, i.e., the episode return of agents in the training environment; when an episode starts, the learning agent can then choose a policy from the RPM as the behavior policy. This innovative self-play training framework leverages agents’ past policies and guarantees the diversity of multi-agent interaction in the training data. We implement RPM on top of MARL algorithms and conduct extensive experiments on Melting Pot. It has been demonstrated that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.

The Problem

In this work, we aim to train MARL agents that can adapt to new scenarios where other agents’ policies are unseen during training for MARL.

The Motivation Example: A Two-Agent Hunting Game. The game’s objective for two agents is to catch the stag together, as one agent acting alone cannot catch the stag and risks being killed. They may perform well in evaluation scenarios similar to the training environment, as shown in Fig. 1 (a) and (b), respectively, but when evaluated in scenarios different from the training ones, these agents often fail. As shown in Fig. 1 (c), the learning agent (called the focal agent following (Leibo et al., 2021)) is supposed to work together with the other agent (called the background agent also following (Leibo et al., 2021)) that is pre-trained and can capture the hare and the stag. In this case, the focal agent would fail to capture the stag without help from its teammate. The teammate of the focal agent may be tempted to catch the hare alone and not cooperate, or may only choose to cooperate with the focal agent after capturing the hare. Thus, the focal agent should adapt to their teammate’s behavior to catch the stag. However, the policy of the background agent is unseen to the focal agent during training. Therefore, without generalization, the agents trained as Fig. 1 (left) cannot achieve an optimal policy in the new evaluation scenario.

Our Evaluation Workflow

The training and evaluation workflows.

Our Method: RPM

The example of RPM's workflow.

Agent's policy network, observation and the global state.

Experiments

Melting Pot Scenarios.

The distribution of RPM's keys for each substrate. It is interesting to find that long-tail distributions of RPM keys are common.

This figure shows the results of HFSP with different sampling ratios. Note that HFSP heavily relies on the sampling ratio. HFSP should be carefully tuned on each substrate to attain good performance, which is not feasible. In contrast, RPM is stable (the sampling ratio is 0.5). HFSP can also perform well in substrates such as Pure Coordination and Prisoners Dilemma, where the return-checkpoint count distribution is more uniformly distributed. We also discussed this finding in the previous figure.

Videos

We present some videos to show our trained RPM agents are able to interact with unseen agents while MAPPO agents cannot perform well or even fail to complete multi-agent tasks when paired with unseen agents.