Evolutionary Population Curriculum For Scaling Multi-Agent Reinforcement Learning

Qian Long∗, Zihan Zhou∗, Abhinav Gupta, Fei Fang, Yi Wu†, Xiaolong Wang† 

The source code is released at https://github.com/qian18long/epciclr2020

Abstract

In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi- Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially.



Paper link: https://openreview.net/pdf?id=SJxbHkrKDH 

Results

Grassland Game

In this game we have Ω = 2 roles of agents (sheep and wolf). A wolf will be rewarded when it collides with (eats) a sheep, and the (eaten) sheep will obtain a negative reward and becomes inactive (dead). A sheep will be rewarded when it comes across a grass pellet and the grass will be collected and respawned in another random position. Note that in this survival game, each individual agent has its own reward and does not share rewards with others. 

Scale 3-2

MADDPG model:

EPC model:

Scale 6-4

MADDPG model:

EPC model:

Scale 12-8

MADDPG model:

EPC model:

Scale 24-16

MADDPG model:

EPC model:

Adversarial battle game

This scenario consists of L units of resources as green landmarks and two teams of agents (i.e., Ω = 2 for each team) competing for the resources. Both teams have the same number of agents (N1 = N2). When an agent collects a unit of resource, the resource will be respawned and all the agents in its team will receive a positive reward. Furthermore, if there are more than two agents from team 1 collide with one agent from team 2, the whole team 1 will be rewarded while the trapped agent from team 2 will be deactivated (dead) and the whole team 2 will be penalized, and vice versa. 

Scale 4-4

MADDPG model:

EPC model:

Scale 8-8

MADDPG model:

EPC model:

Scale 16-16

MADDPG model:

EPC model:

Food Collection Game

This game has N food locations and N fully cooperative agents (Ω = 1). The agents need to collaboratively occupy as many food locations as possible within the game horizon. Whenever a food is occupied by any agent, the whole team will get a reward of 6/N in that timestep for that food. The more food occupied, the more rewards the team will collect. 

Scale 3

MADDPG model:

EPC model:

Scale 6

MADDPG model:

EPC model

Scale 12

MADDPG model:

EPC model:

Scale 24

MADDPG model:

EPC model: