Maximum Entropy Heterogeneous-Agent Mirror Learning

Code                              Paper

Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML)

MEHAML, a theoretically sound maximum entropy actor-critic learning framework in multi-agent reinforcement learning (MARL), guarantees any derived method enjoys the desired properties of monotonic improvement of the joint maximum entropy objective and convergence to the quantal response equilibrium (QRE).

Overview

In this work, we focus our study on cooperative multi-agent tasks, in which a group of agents is trying to optimize a shared reward function.  To address challenges associated with exploration and robustness in MARL, we incorporate the maximum entropy principle into MARL settings, leading to the development of the MEHAML theory and the HASAC algorithm.

The key contributions of our work are outlined below:

A building block of our theory is the multi-agent advantage decomposition theorem (a discovery in HATRPO/HAPPO [ICLR 22, Kuba et.al]). The following figure is an illustration of the sequential update scheme employed by HASAC to achieve coordinated agent updates.

Performance Comparisons on Cooperative MARL Benchmarks

In the majority of environments, we find that compared to current state-of-the-art off-policy and on-policy methods, HASAC achieves strong results while exhibiting higher sample efficiency and robustness.  Videos on each benchmark are shown below.

MAMuJoCo Ant-2x4

MAMuJoCo HalfCheetah-2x3

SMAC MMM2

SMAC 2c_vs_64zg

GRF Corner

LAG ShootMissile-2v2

DexHands ShadowHandCatchAbreast

DexHands ShadowHandCatchOver2Underarm

DexHands ShadowHandOver

 MAMuJoCo Results

MuJoCo tasks challenge a robot to learn an optimal way of motion; Multi-Agent MuJoCo (MAMuJoCo) models each part of a robot as an independent agent, for example, a leg for a spider or an arm for a swimmer.  HASAC consistently outperforms its rivals, thus establishing a new state-of-the-art algorithm for MARL .

SMAC Results

StarCraftII Multi-Agent Challenge (SMAC) contains a set of StarCraft maps in which a team of ally units aims to defeat the opponent team. HASAC achieves performance that is comparable to, or even superior to the other three algorithms. Importantly, this is achieved without employing specific techniques such as PopArt, value normalization, death maskings, and parameter-sharing, which have been demonstrated to substantially enhance the performance of these algorithms.

GRF and LAG Results

Google Research Football Environment (GRF) contains a set of cooperative multi-agent challenges in which a team of agents plays a team of bots in various football scenarios. Light Aircraft Game (LAG) is a recently developed cooperative-competitive environment for red and blue aircraft games, offering various settings such as single control, 1v1, and 2v2 scenarios. We evaluate HASAC on two GRF tasks and one LAG task. We again observe that HASAC generally outperforms its rivals.