MOMA-PPO: A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

Paul Barde, Jakob Foerster, Derek Nowrouzezahrai, Amy Zhang.

International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2024.

Link to paper

Strategy Agreement

Partially observable two-agents reacher

Agents only observe the joint they control (theta_1 and theta_2 for red and blue agent respectively) and the target location (in black).

clockwise expert (in dataset)

counter-clockwise expert (in dataset)

ITD3+BC agents fail at agreeing on a convention

MOMA-PPO agents are able to agree on conventions and can even alternate between the two conventions depending on the target position

Stategy finetuning

Partially observable four-agents ant

Each agent controls a different limb and only observes the joints of the limb it controls. The yellow (white in gifs) agent is the only one to additionally observe the torso (in white).

itd3bc_circle_full_replay_partial.mp4

ITD3+BC trained team fail to coordinate and run in circle because the white agent does not manage to compensate for the other agents.

moma-ppo_steering_full-replay_partial.mp4

MOMA-PPO trained teams are able to produce satisfactory behaviors and the white agent learns to steer the ant towards the correct direction.