Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Ling Pan, Longbo Huang, Tengyu Ma, Huazhe Xu

IIIS, Tsinghua University and Stanford University

Thirty-Ninth International Conference on Machine Learning (ICML 2022)

Abstract:

The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, as many real-world scenarios involve interaction among multiple agents, it is important to resolve offline RL in the more practical multi-agent setting. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, we find that when conservatism-based algorithms are applied to the multi-agent setting, the performance degrades significantly with an increasing number of agents through our empirical analysis. Towards mitigating the degradation, we identify that a key issue is the landscape of the value function can be non-concave, thus the policy gradient improvements are prone to local optima. Multiple agents exacerbate the problem severely, since the suboptimal policy by any agent could lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), to tackle this critical challenge via an effective combination of the first-order policy gradients and the zeroth-order optimization methods for the actor to better optimize the conservative value function. Despite the simplicity, OMAR achieves state-of-the-art performance in a variety of multi-agent continuous and discrete control tasks.

  • Multi-Agent Particle Environments

Coop-Navigation-Random

Coop-Navigation-Med-Replay

Coop-Navigation-Medium

Coop-Navigation-Expert

  • Multi-Agent MuJoCo

Please note that the speed in the following videos is set to be one quarter of the original speed for better visualization.

MA-HalfCheetah-Random

MA-HalfCheetah-Med-Replay

MA-HalfCheetah-Medium

MA-HalfCheetah-Expert

  • StarCraft II Micromanagement Benchmark

2s3z

3s5z

1c3z5z

2c_vs_64zg

  • Maze2D

Umaze

Medium

Large