Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML)
MEHAML, a theoretically sound maximum entropy actor-critic learning framework in multi-agent reinforcement learning (MARL), guarantees any derived method enjoys the desired properties of monotonic improvement of the joint maximum entropy objective and convergence to the quantal response equilibrium (QRE).
Overview
In this work, we focus our study on cooperative multi-agent tasks, in which a group of agents is trying to optimize a shared reward function. To address challenges associated with exploration and robustness in MARL, we incorporate the maximum entropy principle into MARL settings, leading to the development of the MEHAML theory and the HASAC algorithm.
The key contributions of our work are outlined below:
We extend the maximum entropy principle from reinforcement learning to MARL settings, incorporating its benefits into our research.
We propose the MEHAML framework, leveraging the maximum entropy principle to enhance the performance of cooperative multi-agent tasks.
As a natural outcome of the MEHAML framework, we derive the heterogeneous-agent soft actor-critic (HASAC) algorithm.
Through extensive experiments, we demonstrate the superior performance of HASAC across a variety of environments, including Multi-Agent MuJoCo, StarCraftII, Google Research Football and Light Aircraft Game.
A building block of our theory is the multi-agent advantage decomposition theorem (a discovery in HATRPO/HAPPO [ICLR 22, Kuba et.al]). The following figure is an illustration of the sequential update scheme employed by HASAC to achieve coordinated agent updates.
Performance Comparisons on Cooperative MARL Benchmarks
In the majority of environments, we find that compared to current state-of-the-art off-policy and on-policy methods, HASAC achieves strong results while exhibiting higher sample efficiency and robustness. Videos on each benchmark are shown below.
MAMuJoCo Ant-2x4
MAMuJoCo HalfCheetah-2x3
SMAC MMM2
SMAC 2c_vs_64zg
GRF Corner
LAG ShootMissile-2v2
MAMuJoCo Results
MuJoCo tasks challenge a robot to learn an optimal way of motion; Multi-Agent MuJoCo (MAMuJoCo) models each part of a robot as an independent agent, for example, a leg for a spider or an arm for a swimmer. HASAC consistently outperforms its rivals, thus establishing a new state-of-the-art algorithm for MARL .
SMAC Results
StarCraftII Multi-Agent Challenge (SMAC) contains a set of StarCraft maps in which a team of ally units aims to defeat the opponent team. HASAC achieves performance that is comparable to, or even superior to the other three algorithms. Importantly, this is achieved without employing specific techniques such as PopArt, value normalization, death maskings, and parameter-sharing, which have been demonstrated to substantially enhance the performance of these algorithms.
GRF and LAG Results
Google Research Football Environment (GRF) contains a set of cooperative multi-agent challenges in which a team of agents plays a team of bots in various football scenarios. Light Aircraft Game (LAG) is a recently developed cooperative-competitive environment for red and blue aircraft games, offering various settings such as single control, 1v1, and 2v2 scenarios. We evaluate HASAC on two GRF tasks and one LAG task. We again observe that HASAC generally outperforms its rivals.