MEHARL, a framework for learning stochastic policies in cooperative multi-agent reinforcement learning (MARL), encompasses three key components, including the probabilisitc graphical model derivation of maximum entropy MARL, the HASAC algorithm with monotonic improvement and quantal response equilibrium (QRE) convergence properties, and the unified MEHAML template that provides any induced maximum entropy method with the same theoretical guarantees as HASAC. HASAC is robust against various real-world uncertainties, including perturbations in rewards, dynamics, states, and actions. It consistently outperforms strong baselines, demonstrating improved stability, sample efficiency, and exploration.
In this work, we focus our study on cooperative multi-agent tasks, in which a group of agents is trying to optimize a shared reward function. To address challenges associated with exploration and robustness in MARL, we derive maximum entropy objective of MARL from probabilistic inference perspective. This leads to the development of the HASAC algorithm and the MEHAML theory. Building on our conference work, which introduced the HASAC algorithm and MEHAML framework for cooperative MARL, we have expanded these concepts to address robustness more comprehensively in our journal version.
Key Contributions of Our Conference Work:
MaxEnt MARL Formulation: We model cooperative MARL as a probabilistic graphical inference problem, deriving the maximum entropy MARL objective.
HASAC Algorithm: We develop the HASAC algorithm, ensuring monotonic improvement and convergence to QRE without restrictive assumptions.
MEHAML Framework: We generalize HASPI to the MEHAML framework, supporting the design of multiple MaxEnt MARL algorithms with consistent guarantees.
Experimental Results: HASAC demonstrates superior performance across various environments, showing improved training stability, sample efficiency, and enhanced exploration.
Journal Version Additions:
Robustness Guarantee: We prove that our MaxEnt MARL objective is inherently robust to perturbations in rewards, environment dynamics, states, and actions under certain conditions.
Superior Robustness Results: Testing HASAC against diverse uncertainties, it consistently outperforms baselines in all scenarios.
Real-World Robustness: HASAC is deployed in a real-world robotic arena, demonstrating superior robustness against multiple types of perturbations.
A building block of our theory is the multi-agent advantage decomposition theorem (a discovery in HATRPO/HAPPO [ICLR 22, Kuba et.al]). The following figure is an illustration of the sequential update scheme employed by HASAC to achieve coordinated agent updates.
In the majority of environments, we find that compared to current state-of-the-art off-policy and on-policy methods, HASAC achieves strong results while exhibiting higher sample efficiency and robustness. Videos on each benchmark are shown below.
Bi-DexHands offers numerous bimanual manipulation tasks that are designed to match various human skill levels. Building on the Isaac Gym simulator, Bi-DexHands supports running thousands of environments simultaneously. HASAC outperforms the other five methods by a large margin, showcasing faster convergence speed and lower variance.
MuJoCo tasks challenge a robot to learn an optimal way of motion; Multi-Agent MuJoCo (MAMuJoCo) models each part of a robot as an independent agent, for example, a leg for a spider or an arm for a swimmer. HASAC consistently outperforms its rivals, thus establishing a new state-of-the-art algorithm for MARL .
StarCraftII Multi-Agent Challenge (SMAC) contains a set of StarCraft maps in which a team of ally units aims to defeat the opponent team. HASAC achieves over 90% win rates in 7 out of 8 maps and outperforms other strong baselines in most maps.
Google Research Football Environment (GRF) contains a set of cooperative multi-agent challenges in which a team of agents plays a team of bots in various football scenarios. Light Aircraft Game (LAG) is a recently developed cooperative-competitive environment for red and blue aircraft games, offering various settings such as single control, 1v1, and 2v2 scenarios. We evaluate HASAC on two GRF tasks and one LAG task. We again observe that HASAC generally outperforms its rivals.
We evaluate HASAC on the Spread, Reference, and Speaker_Listener tasks of the Multi-Agent Particle Environment (MPE), which were implemented in PettingZoo. HASAC consistently outperforms the baselines in terms of both average return and sample efficiency.
We evaluate HASAC's robustness in simulations and real-world scenarios using the Pursuit-Evade testbed. HASAC demonstrates inherent robustness against uncertainties in rewards, dynamics, states, and actions without additional tuning. In real-world deployments, HASAC consistently maintains robustness, while baseline methods struggle with uncertainties.
(a) Environment Uncertainty
(b) Action Uncertainty
Simulation Results
Real-world Results