Off-Policy Correction For Multi-Agent Reinforcement Learning

Michał Zawalski, Błażej Osiński, Henryk Michalewski, Piotr Miłoś

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case,  multi-agent problems are often harder to train and analyze theoretically. To amend this situation, we propose MA-Trace - a new actor-critic algorithm for multi-agent reinforcement learning.

MA-Trace is a natural generalization of V-Trace to the multi-agent domain. The idea behind our algorithm is the usage of importance sampling as an off-policy correction method. This allows for highly scalable multi-worker (multi-node) training while preserving simplicity.

Importantly, MA-Trace enjoys theoretical grounding - we prove a fixed-point theorem that guarantees convergence. 

We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge set - a standard benchmark for multi-agent algorithms. MA-Trace, despite its simplicity, achieves high performance on all the tasks and exceeds state-of-the-art results on some of them.

Read the detailed description of MA-Trace here.

Experimental results

We evaluate our algorithm on StarCraft Multi-agent Challenge, read the description of this benchmark here.

Experiments show that MA-trace masters all the tasks in the benchmark. One exception is the 3s_vs_5z task. However, in this case, our algorithm exploits the reward scheme (see below).

What is particularly interesting, our algorithm manages to learn some techniques associated with professional human players, such as focusing fire, withdrawing low-health units, hit-and-run, sacrificing a unit to deceive the opponent, and others.

Compared to state-of-the-art

MA-Trace reaches performance comparable to state-of-the-art on all the tasks, even exceeds it on some. In the figure below we compare MA-Trace with the best performing algorithms, see the paper for the full numerical data.

Importance of importance sampling

The strong performance of MA-Trace is due to using the importance weights - the key feature of our algorithm. Indeed, our experiments show that the algorithm fails to reach decent performance without these corrections, even for easy tasks. The deterioration gets worse with increasing the number of workers.

Gameplayes of MA-Trace

Here we present a few fights performed by MA-Trace on the hardest tasks. In every game, we control the red team, while the blue one is controlled by the built-in AI.

27m_vs_30m

27m_vs_30m.mp4

MMM2

MMM2.mp4

3s5z_vs_3s6z

3s5z_vs_3s6z.mp4

10m_vs_11m

10m_vs_11m.mp4

6h_vs_8z

6h_vs_8z.mp4

corridor

corridor.mp4

Learning on the 3s_vs_5z task

As noted before, MA-Trace masters all the tasks except 3s_vs_5z. In this scenario, 3 Stalkers (right) fight against 5 Zealots (left). Stalkers can attack the enemy from a distance. However, they are no match for Zealots in close combat. They can gain an advantage by shooting the enemies while they are away and fleeing when they get close.

MA-Trace discovers this strategy quite quickly and manages to win almost every episode. During later training, it learns to restrain from killing Zealots. This seems surprising but is perfectly in line with the reward structure of the task. Namely, the stalkers are rewarded for inflicting damage. It thus pays to keep Zealots alive as they regenerate, and over time more can sustain a bigger total hit count. 

This is most likely unintended by  SMAC's creators. Interestingly, the described strategy is not found by algorithms such as QMIX, IQL or, VDN.

 

Learning on the corridor task

Another super-hard scenario (according to the authors of SMAC) is corridor. In this task, we control a team of 6 Zealots against 24 enemy Zerglings. Though Zealots are far more powerful, they are outnumbered; thus, the open fight is unreasonable. However, the fighting arena contains a narrow passage. The authors of SMAC suggest that a winning strategy is to gather the forces in that passage, where the number of enemy units is irrelevant (possibly inspired by the Battle of Thermopylae). However, MA-Trace chooses another interesting option, not mentioned by the authors.

corridor.mp4

Firstly, our forces split into two groups. One (strong) hides in the corner, where it easily defeats a few enemies, while the other (one or two units) attracts a majority of the enemies to the other side and sacrifices itself. After defeating the second group, the enemies pass to the far side of the arena and wait, unaware of the hidden group. Then the strong group attacks them from behind and defeats the Zerglings one by one. See the recording of an example episode.

This strategy can be seen as an exploit of the built-in AI. It appears to be either easier to learn or more stable to execute than the one proposed by the authors of SMAC. It is discovered consistently across hundreds of experiments.

Observation vs full state

The centralized critic of MA-Trace is used only during training. Therefore it can utilize any accessible knowledge. A natural choice is to use the team knowledge - aggregated observations available to all agents. We call this approach MA-Trace (obs). It can be applied to any environment; we consider it the main version of our algorithm.


To facilitate centralized training, SMAC provides access to the full state of the environment. It provides additional information in the case there are enemy units beyond the sight range of any ally. We experimented with the critic network conditioned on the full state, called MA-Trace (full).


Our experiments show that in some tasks using the full state provides a minor advantage. Perhaps surprisingly, in some of the hardest scenarios (corridor, 6h_vs_8z), MA-Trace (full) fails to win any episode. We hypothesize that the benefits of more accurate estimates of value function are overweighted by more variance in the training of the policy.