Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in the evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the problem with Markov Games and propose a simple yet effective method, ranked policy memory (RPM), to collect diverse multi-agent trajectories for training MARL policies with good generalizability. The main idea of RPM is to maintain a look-up memory of policies. In particular, we try to acquire various levels of behaviors by saving policies via ranking the training episode return, i.e., the episode return of agents in the training environment; when an episode starts, the learning agent can then choose a policy from the RPM as the behavior policy. This innovative self-play training framework leverages agents’ past policies and guarantees the diversity of multi-agent interaction in the training data. We implement RPM on top of MARL algorithms and conduct extensive experiments on Melting Pot. It has been demonstrated that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.
In this work, we aim to train MARL agents that can adapt to new scenarios where other agents’ policies are unseen during training for MARL.
The Motivation Example: A two-agent Stag-Hunt Game. The agents are trained to obtain policies maximizing the group reward by shooting arrows to the stag. As a result, they may perform well in evaluation scenarios similar to the training environment, as shown in Figures (left) and (middle), respectively. However, these agents may fail when evaluated in scenarios different from the training scenarios. As shown in Figure (right), the learning agent (called the focal agent following the convention in Leibo 2021 et al. is supposed to work together with another agent (called the background agent following the naming in Leibo 2021 et al.) who is pre-trained to be selfish (ie, only capture the hare). In this case, the focal agent will fail to capture the stag without the help from its teammates and the optimal policy to capture the hare. However, background agents are unseen to the focal agent during training. Therefore, without generalization, the agents trained as Figure (left) cannot achieve an optimal policy in the new evaluation scenario.
The training and evaluation workflows.
The example of RPM's workflow.
Agent's policy network, observation and the global state.
Melting Pot Scenarios.
We present some videos to show our trained RPM agents are able to interact with unseen agents while MAPPO agents cannot perform well or even fail to complete multi-agent tasks when paired with unseen agents.
RPM in Chicken Game. RPM performs well.
MAPPO in Chicken Game. MAPPO performs poorly.
RPM in Prisoners' Dilemma. RPM agents perform well.
MAPPO in Prisoners' Dilemma. MAPPO agents perform poorly.
RPM in Pure Coordination. RPM agents perform well.
MAPPO in Pure Coordination. MAPPO agents perform poorly.
RPM in Clean Up. RPM agent is able to collect apples
MAPPO in Clean Up. MAPPO agent cannot collect apples
RPM in Rational Coordination. RPM agents perform well.
MAPPO in Rational Coordination. MAPPO agents perform poorly.
RPM in Stag Hunt. RPM agents perform well.
MAPPO in Stag Hunt. MAPPO agents perform poorly.
Thanks for visiting!