Multi-Agent Collaboration via Reward Attribution Decomposition

Abstract

Recent advances in multi-agent reinforcement learning (MARL) have achieved super-human performance in games like Quake 3 and Dota 2. Unfortunately, these techniques require orders-of-magnitude more training rounds than humans and don't generalize to new agent configurations even on the same game. In this work, we propose Collaborative Q-Learning (CollaQ) that shows state-of-the-art performance in StarCraft multi-agent challenge and support ad hoc team play. For this, we first formulate multi-agent collaboration as a joint optimization on reward assignment and show that each agent has an approximate optimal policy that decomposes into two parts: one part that only relies on the agent's own state, and the other part that is related to states of nearby agents. Following this novel finding, CollaQ decomposes the Q-function of each agent into a self term and an interactive term, with a Multi-Agent Reward Attribution (MARA) Loss that regularizes the training. CollaQ is evaluated on various StarCraft maps and shows that it outperforms existing state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improving the win rate by 40% with the same number of samples. In the more challenging ad hoc team play setting (i.e., reweight/add/remove units without re-training or finetuning), CollaQ outperforms previous SoTA by over 30%.

Video for StarCraft Multi-Agent Challenge on map MMM2. CollaQ and QMIX controls the red team and the blue team is controlled by built-in AI. CollaQ manage to survive 9 agents

Results

The tables provide results for CollaQ and other multi-agent reinforcement learning methods on StarCraft Multi-Agent Challenge. CollaQ outperforms other methods by over 30% and can successfully generalize to ad-hoc team play (we tested the scenarios including randomly assigning VIP agents, swapping, adding and removing agents at test time) .

Videos on StarCraft Multi-Agent Challenge

CollaQ and QMIX trained standard StarCraftII Multi-agent Challenge setting. This multi-agent challenge deals with controlling a team of units to defeat the enemies. The red side is controlled by the RL agents and the blue side is controlled by the built-in AI.

CollaQ learns following behavior: (1) Medivac dropship only heals the unit under attack, (2) units with low HP are dragged backward to avoid focused fire from the opponent, (3) healthy units move forward to undertake the focused fire. (4) complicated coordination behavior between healthy and unhealthy units so about 5 out of 10 units get survived. QMIX only learns (1) and (2).

CollaQ learns to focus fire on one side of the attack to clear one of the corridors. These behaviors of QMIX is not clear.

Videos on StarCraft Multi-Agent Challenge VIP Ad-Hoc

CollaQ and QMIX trained in StarCraftII Multi-agent Challenge setting with one VIP agent in the team whose survival matters. This multi-agent challenge deals with controlling a team of units to defeat the enemies as well as keeping the VIP agent alive. The red side is controlled by the RL agents and the blue side is controlled by the built-in AI. We select an ad-hoc VIP agent at test time.

CollaQ learns the behavior of "protecting VIP" agents. when the team has a very high chance to win, the VIP agent is hiding behind other agents to avoid being attacked. Such behavior is not clearly shown in QMIX.

Snapshots on StarCraft Multi-Agent Challenge

It shows several interesting behavior:

(1) The Medivac is healing the agent under attack;

(2) The agents under attack are dragged backward;

(3) The healthy agents are moving forward to take attack;

Snapshots on StarCraft Multi-Agent Challenge VIP Ad Hoc

It shows that CollaQ has a super interesting behavior: When the team has a high change to win, the VIP agent will hide behind all other agents to avoid attack from the enemy.