MOMA-PPO: A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem
Paul Barde, Jakob Foerster, Derek Nowrouzezahrai, Amy Zhang.
International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2024.
Link to paper
Strategy Agreement
Partially observable two-agents reacher
Agents only observe the joint they control (theta_1 and theta_2 for red and blue agent respectively) and the target location (in black).
clockwise expert (in dataset)
counter-clockwise expert (in dataset)
ITD3+BC agents fail at agreeing on a convention
MOMA-PPO agents are able to agree on conventions and can even alternate between the two conventions depending on the target position
Stategy finetuning
Partially observable four-agents ant
Each agent controls a different limb and only observes the joints of the limb it controls. The yellow (white in gifs) agent is the only one to additionally observe the torso (in white).
ITD3+BC trained team fail to coordinate and run in circle because the white agent does not manage to compensate for the other agents.
MOMA-PPO trained teams are able to produce satisfactory behaviors and the white agent learns to steer the ant towards the correct direction.