Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning
Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, and Yi Wu
Wei Fu, Chao Yu, Zelai Xu, Jiaqi Yang, and Yi Wu
We revisit two value decomposition (VD) and parameter sharing in cooperative MARL and show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, these principles can be problematic and lead to undesired outcomes. In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases. Inspired by our theoretical analysis, we present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behavior.
We first consider an n-player permutation game, where each agent has n actions {1, 2, ..., n}. Agents receive a reward +1 if and only if the joint action is a permutation over n. XOR game is a 2-player permutation game, where agents receive a reward +1 if and only if they output different actions.
There are n! symmetric and equally optimal strategies in this game!
Due to limited expressiveness power, VD methods fundamentally fail to converge in XOR game, even for the most representative method QPLEX.
We first show that in contrast to VD methods, individual policy learning (PG-Ind), i.e., learning a separate policy parameter for each agent's policy, can provably converge to an optimum in XOR game. Then, we propose an auto-regressive policy representation (PG-AR) to learn a single policy that covers all possible optimal modes.
In Bridge game, each agent needs to get through the bridge to reach the spawn point of the other agent.
Bridge can be interpreted as a temporal version of XOR game, since the two agents need to perform different macro actions, i.e., either wait or move, to achieve the optimal reward.
While VD methods may fail to converge, PG methods can converge to one of the optimum.
Uni-modal behavior by PG-Ind.
Agent 1 (red) always passes the bridge first.
Multi-modal behavior by PG-AR in Bridge game.
Depending on the action of agent 1 (red), agent 2 (blue) makes different decisions.
Agent-ID can be critical. (PG-ID vs PG-sh.)
PG-Ind. can achieve higher results than PG-ID on 5/6 scenarios.
Marines keep moving within the map to ensure a safe distance to the enemy.
Marines keep standing and perform attacks alternately while ensuring there is only one attacking marine at each timestep.
Plain pass-and-shoot.
"Tiki-Taka" style behavior: each player keeps passing the ball to their teammates.