ABSTRACT

The availability of challenging benchmarks has played a key role in the recent progress of machine learning. In cooperative multi-agent reinforcement learning, the StarCraft Multi-Agent Challenge (SMAC) has become a popular testbed for the paradigm of centralised training with decentralised execution. However, after several years of sustained improvement, algorithms now achieve near-perfect performance on this benchmark. In this work, we conduct new analysis demonstrating that SMAC is not sufficiently stochastic to require complex closed-loop policies. In particular, we show that an open-loop policy conditioned only on the timestep and agent ID can achieve non-trivial win rates for many SMAC scenarios. To address this limitation, we introduce SMACv2, a new version of the benchmark in which scenarios are procedurally generated and require agents to generalise to previously unseen settings (from the same distribution) during evaluation. We show that these changes ensure the benchmark requires the use of closed-loop policies. We evaluate several state-of-the-art algorithms on SMACv2 and show that it presents significant challenges not present in the original benchmark. Our analysis illustrates that SMACv2 addresses the discovered deficiencies of SMAC and can help benchmark the next generation of MARL methods.

Stochasticity in SMAC

SMAC has been a popular MARL benchmark for a number of years, but recent work has demonstrated strong results on all scenarios, meaning SMAC can no longer distinguish MARL algorithms due to ceiling effects. 

In the plot below (where we plot the mean and standard deviation across 3 seeds) we demonstrate that SMAC is insufficiently stochastic by training a policy on SMAC using only the timestep and agent id as observation features. Thus the policy can only learn a fixed distribution over actions at each timestep. We call this an open-loop policy. This open-loop policy (in blue) is compared with the results from the closed-loop (in orange). In total, there are only 4 SMAC maps for which an open-loop policy cannot learn a good policy at all.

SMACv2

SMACv2 addresses this lack of stochasticity in SMAC by adding procedural content generation (PCG) elements to SMAC. Specifically the start positions are now randomised, as are the unit types on each team. Start positions come in two flavours, reflect, where the ally positions are randomly chosen to be on the left half of the map and the enemy start positions are a reflection of them, and surround, where the allies start surrounded by enemy units on each of the diagonals.  These two types of start position are shown on the right. 

There are 15 scenarios in SMACv2 -- 5 different numbers of units for each of the 3 races. The different number of units are 5_vs_5, 10_vs_10, 10_vs_11, 20_vs_20 and 20_vs_23 where the first number is the number of allies and the second the number of enemies. Below are videos of trained agents in a range of SMACv2 scenarios.

SMACv2 Training Results

The results below (which show the mean and standard deviation across 3 seeds) show MAPPO, QMIX and an open-loop version of MAPPO trained on all 15 SMACv2 scenarios. Some of the scenarios are very difficult, for example, the 20_vs_23 scenarios have very low win rates, whereas others such as protoss_5_vs_5 show very high win rates. Crucially, the open-loop policy fails to learn on all scenarios, demonstrating that SMACv2 is now sufficiently stochastic to necessitate closed-loop policies.

Extended Partial Observability Challenge

Whilst agents in SMAC do not observe the global state, this partial observability alone is not particularly meaningful. The hidden information must also be uninferrable (and therefore stochastic), relevant to the task, and known to some, but not all of, the agents. To establish an extension of SMAC that offers more meaningful partial observability, we introduce a setting where enemies are stochastically masked for each agent.

The results below show a comparison of mean test win rate in SMACv2 when the probability of masking an enemy for a given agent (for the remainder of an episode) is 0, 0.5 and 1, with 5 agents against 5 enemy units for QMIX (left) and MAPPO (right).