Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning

Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, and Yi Wu

Abstract

We revisit two value decomposition (VD) and parameter sharing in cooperative MARL and show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, these principles can be problematic and lead to undesired outcomes. In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases. Inspired by our theoretical analysis, we present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behavior.

XOR/PERMUTATION GAME

Graphical illustration of 4-player permutation game.

We first consider an n-player permutation game, where each agent has n actions {1, 2, ..., n}. Agents receive a reward +1 if and only if the joint action is a permutation over n. XOR game is a 2-player permutation game, where agents receive a reward +1 if and only if they output different actions.

There are n! symmetric and equally optimal strategies in this game!

VD Loss In XOR Game

Due to limited expressiveness power, VD methods fundamentally fail to converge in XOR game, even for the most representative method QPLEX.

PG methods in XOR/permutation game

Heatmap demonstrating the frequency of every possible joint action in 4-player permutation game.

We first show that in contrast to VD methods, individual policy learning (PG-Ind), i.e., learning a separate policy parameter for each agent's policy, can provably converge to an optimum in XOR game. Then, we propose an auto-regressive policy representation (PG-AR) to learn a single policy that covers all possible optimal modes.

BRIDGE GAME

VD Methods vs PG Methods

In Bridge game, each agent needs to get through the bridge to reach the spawn point of the other agent.

Bridge can be interpreted as a temporal version of XOR game, since the two agents need to perform different macro actions, i.e., either wait or move, to achieve the optimal reward.

While VD methods may fail to converge, PG methods can converge to one of the optimum.

Individual Policy

Uni-modal behavior by PG-Ind.

Agent 1 (red) always passes the bridge first.

Auto-Regressive Policy

Multi-modal behavior by PG-AR in Bridge game.

Depending on the action of agent 1 (red), agent 2 (blue) makes different decisions.

SMAC & GRF

VD Methods vs PG Methods

smac

Agent-ID can be critical. (PG-ID vs PG-sh.)

GRF

PG-Ind. can achieve higher results than PG-ID on 5/6 scenarios.

EMERGENT BEHAVIOR LEARNED by PG-AR

We found interesting emergent behaviors discovered by PG-AR that require particularly strong intra-coordination.

PG-IND (MAPPO)

Marines keep moving within the map to ensure a safe distance to the enemy.

PG-AR

Marines keep standing and perform attacks alternately while ensuring there is only one attacking marine at each timestep.

PG-IND (HAPPO)

Plain pass-and-shoot.

PG-AR

"Tiki-Taka" style behavior: each player keeps passing the ball to their teammates.