Maximum Entropy Heterogeneous-Agent Reinforcement Learning

Code Paper

Maximum Entropy Heterogeneous-Agent Reinforcement Learning (MEHARL)

MEHARL, a framework for learning stochastic policies in cooperative multi-agent reinforcement learning (MARL), encompasses three key components, including the probabilisitc graphical model derivation of maximum entropy MARL, the HASAC algorithm with monotonic improvement and quantal response equilibrium (QRE) convergence properties, and the unified MEHAML template that provides any induced maximum entropy method with the same theoretical guarantees as HASAC.

Overview

In this work, we focus our study on cooperative multi-agent tasks, in which a group of agents is trying to optimize a shared reward function. To address challenges associated with exploration and robustness in MARL, we derive maximum entropy objective of MARL from probabilistic inference perspective. This leads to the development of the HASAC algorithm and the MEHAML theory.

The key contributions of our work are outlined below:

We model cooperative MARL as a probabilistic graphical inference problem and derive the maximum entropy MARL objective.
We introduce heterogeneous-agent soft policy iteration (HASPI) and develop the heterogeneous-agent soft actor-critic (HASAC) algorithm.
We generalize the HASPI procedure to the Maximum Entropy Heterogeneous-Agent MirrorLearning (MEHAML) template, which provides any induced method with the desired properties of monotonic improvement and QRE convergence.
Through extensive experiments, we demonstrate the superior performance of HASAC across a variety of environments, including Bi-DexHands, Multi-Agent MuJoCo, StarCraftII, Google Research Football, Multi-Agent Particle Environment and Light Aircraft Game.

A building block of our theory is the multi-agent advantage decomposition theorem (a discovery in HATRPO/HAPPO [ICLR 22, Kuba et.al]). The following figure is an illustration of the sequential update scheme employed by HASAC to achieve coordinated agent updates.

Performance Comparisons on Cooperative MARL Benchmarks

In the majority of environments, we find that compared to current state-of-the-art off-policy and on-policy methods, HASAC achieves strong results while exhibiting higher sample efficiency and robustness. Videos on each benchmark are shown below.

MAMuJoCo Ant-2x4

SMAC 2c_vs_64zg

DexHands ShadowHandCatchAbreast

MAMuJoCo HalfCheetah-2x3

GRF Corner

DexHands ShadowHandCatchOver2Underarm

SMAC MMM2

LAG ShootMissile-2v2

DexHands ShadowHandOver

Bi-DexHands Results

Bi-DexHands offers numerous bimanual manipulation tasks that are designed to match various human skill levels. Building on the Isaac Gym simulator, Bi-DexHands supports running thousands of environments simultaneously. HASAC outperforms the other five methods by a large margin, showcasing faster convergence speed and lower variance.

MAMuJoCo Results

MuJoCo tasks challenge a robot to learn an optimal way of motion; Multi-Agent MuJoCo (MAMuJoCo) models each part of a robot as an independent agent, for example, a leg for a spider or an arm for a swimmer. HASAC consistently outperforms its rivals, thus establishing a new state-of-the-art algorithm for MARL .

SMAC Results

StarCraftII Multi-Agent Challenge (SMAC) contains a set of StarCraft maps in which a team of ally units aims to defeat the opponent team. HASAC achieves over 90% win rates in 7 out of 8 maps and outperforms other strong baselines in most maps.

GRF and LAG Results

Google Research Football Environment (GRF) contains a set of cooperative multi-agent challenges in which a team of agents plays a team of bots in various football scenarios. Light Aircraft Game (LAG) is a recently developed cooperative-competitive environment for red and blue aircraft games, offering various settings such as single control, 1v1, and 2v2 scenarios. We evaluate HASAC on two GRF tasks and one LAG task. We again observe that HASAC generally outperforms its rivals.

MPE Results

We evaluate HASAC on the Spread, Reference, and Speaker_Listener tasks of the Multi-Agent Particle Environment (MPE), which were implemented in PettingZoo. HASAC consistently outperforms the baselines in terms of both average return and sample efficiency.