Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning

Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, and Yi Wu

Abstract

We revisit two value decomposition (VD) and parameter sharing in cooperative MARL and show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, these principles can be problematic and lead to undesired outcomes. In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases. Inspired by our theoretical analysis, we present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behavior.

Paper

Code

XOR/PERMUTATION GAME

Graphical illustration of 4-player permutation game.

We first consider an n-player permutation game, where each agent has n actions {1, 2, ..., n}. Agents receive a reward +1 if and only if the joint action is a permutation over n. XOR game is a 2-player permutation game, where agents receive a reward +1 if and only if they output different actions.

There are n! symmetric and equally optimal strategies in this game!

VD Loss In XOR Game

Due to limited expressiveness power, VD methods fundamentally fail to converge in XOR game, even for the most representative method QPLEX.

PG methods in XOR/permutation game

Heatmap demonstrating the frequency of every possible joint action in 4-player permutation game.

We first show that in contrast to VD methods, individual policy learning (PG-Ind), i.e., learning a separate policy parameter for each agent's policy, can provably converge to an optimum in XOR game. Then, we propose an auto-regressive policy representation (PG-AR) to learn a single policy that covers all possible optimal modes.

BRIDGE GAME

VD Methods vs PG Methods

In Bridge game, each agent needs to get through the bridge to reach the spawn point of the other agent.

Bridge can be interpreted as a temporal version of XOR game, since the two agents need to perform different macro actions, i.e., either wait or move, to achieve the optimal reward.

While VD methods may fail to converge, PG methods can converge to one of the optimum.