Learning Nearly Decomposable Value Functions via Communication Minimization

This website shows replays of our method and baselines on six StarCraft II scenarios.

In our method, the length of messages is set to 3, and we show the results when 80% (by bits) messages are cut off. For QMIX+TarMAC, we retain all the messages.

3s_vs_5z (3 Stalkers vs. 5 Zealots)

3 Stalkers encounter 5 Zealots on a map. Zealots can deal high damage but are melee units. To win the game, Stalkers have to learn a micro-trick called kiting -- moving around and attacking before Zealots move near enough to attack.

Kiting requires the knowledge of exact positions of enemies. Since we narrow down the sight range of units, 3 Stalkers has to coordinate and learn to communicate necessary messages to win.

Ours


QMIX


TARMAC + QMIX


3b_vs_1h1m (3 Banelings vs. 1 Hydralisk & 1 Medivac)

Three Banelings spawning randomly on the map try to kill a Hydralisk assisted by a Medivac. Three Banelings together can just blow up the Hydralisk (In StarCraft II, Banelings can deal 30 damage to the 90 health Hydralisk). This requires the Banelings to attack the Hydralisk simultaneously. Otherwise, the Medivac will heal the Hydralisk and there will be no chance of winning. This task is designed to test whether our algorithm can learn coordinated decentralized policies.

Ours

Three Banelings learn to communicate their positions and coordinate their attack timing.

QMIX

Without communication, Banelings rush to attack the Hydralisk. Such strategy can only win 18% of the episodes.

QMIX+TarMAC

With attentional communication, the left Baneling learns to wait in some cases. Compared to QMIX, this tactic improves the performance by about 20%.

1o2r_vs_4r (1 Overseer & 2 Roaches vs. 4 Reapers)

An Overseer has found 4 Reapers. Ally units of the Overseer, 2 Roaches, need to get there and kill the Reapers. At the beginning of each episode, the Overseer and Reapers spawn at a random point on the map while the Roaches are initialized at another random point. Since the sight range of Roaches is limited, only the Overseer knows the position of the enemy. Therefore, a learning algorithm has to learn to communicate the target position to effectively win the combat.

Ours

Our method learns the desired communication strategy. And necessary information is retained even after 80% messages are dropped.

QMIX

Without communication, QMIX Roaches learn to patrol the left side of the map. Such strategy results in a win rate of 40% . This is reasonable because Reapers spawn randomly on the map, and appear on the left with a probability of 50%.

TARMAC + QMIX

TarMAC+QMIX Overseer learns to send the position of Reapers to Roaches. However, the attention mechanism didn't effectively learn the importance of messages -- when 80% messages are cut off, its performance is the same as that of QMIX (see Fig. 8 in the paper).

5z_vs_1ul (5 Zealots vs. 1 Ultralisk)

5 Zealots try to kill a powerful Ultralisk. A sophisticated micro-trick demanding right positioning and attack timing has to be learnt to win.

Ours


QMIX


TARMAC + QMIX


MMM (7 Marines, 2 Marauders, 1 Medivac)

Symmetric teams consist of 7 Marines, 2 Marauders and 1 Medivac spawn at two fixed points and the enemy team are tasked to attack the ally team. This task can demonstrate the scalability of our method.

Ours


QMIX


TARMAC + QMIX


1o10b_vs_1r (1 Overseer & 10 Banelings vs. 1 Roach)

In a map full of cliffs, an Overseer detects a Roach. Teammates of the Overseer, 10 Banelings need to kill this Roach to get the winning reward. Each Baneling deals 20 damage to the 145 health Roach. Therefore, we need 8 Banelings to kill a Roach. The Overseer and the Roach spawn at a random point while the Banelings spawn randomly all round the map.

In the minimized communication strategy, Banelings can keep silence and the Overseer needs to encode its position and send it to Banelings (Banelings don’t know where the Roach is otherwise). We use this task to test the performance of our method in complex scenarios.

Ours

Our method learns a economic strategy where 8 Banelings attack the Roach. 8 is exactlly the number of Banelings that is needed to kill the Roach.

QMIX

QMIX agents learn to patrol the right side of the map. This strategy leads to a win rate of 40%.

TARMAC + QMIX

Attentional communication is not effective for this 10 agents scenario. Some Banelings didn't learn to attack the Roach (for example, in the first epsiode shown here, only 7 of the Banelings participate in the battle). And agents seem to only pay attention to the bottom part of the map.