We introduce a curriculum learning algorithm, Variational Automatic Curriculum Learning (VACL), for solving challenging goal-conditioned cooperative multi- agent reinforcement learning problems. We motivate our paradigm through a variational perspective, where the learning objective can be decomposed into two terms: task learning on the current task distribution, and curriculum update to a new task distribution. Local optimization over the second term suggests that the curriculum should gradually expand the training tasks from easy to hard. Our VACL algorithm implements this variational paradigm with two practical components, task expansion and entity progression, which produces training curricula over both the task configurations as well as the number of entities in the task. Experiment results show that VACL solves a collection of sparse-reward problems with a large number of agents. Particularly, using a single desktop machine, VACL achieves 98% coverage rate with 100 agents in the simple-spread benchmark and reproduces the ramp-use behavior originally shown in OpenAI’s hide-and-seek project.
Building intelligent agents in complex multi-agent games remains a long-standing challenge in artificial intelligence. Recently, it has been a trend to apply multi-agent reinforcement learning (MARL) to extremely challenging multi-agent games, such as Dota II , StarCraft II and Hanabi . Despite these successes, learning intelligent multi-agent policies in general still remains a great RL challenge. Multi-agent games allow sophisticated interactions between agents and environment. Feasible solutions may require non-trivial intra-agent coordination, which leads to substantially more complex strategies than the single-agent setting. Moreover, as the number of agents increases, the joint action space grows at an exponential rate, which results in an exponentially large policy search space. Thus, most existing MARL applications typically require shaped rewards, or assume simplified environment dynamics, or only handle a limit number of agents.
We tackle goal-conditioned cooperative MARL problems with sparse rewards through a novel variational inference perspective. Assuming each task can be parameterized by a continuous representation, we introduce a variational proposal distribution over the task space and then decomposing the overall training objective into two separate terms, i.e., a policy improvement term and a task proposal update term. By treating the proposal distribution updates as the training curriculum, such a variational objective naturally suggests a curriculum learning framework by alternating between curriculum update and MARL training. In addition, we propose a continuous relaxation technique for discrete variables so that the variational curricula can be also applied over the number of agents and objects in the task, which leads to a generic and unified training paradigm for the cooperative MARL setting.
We implement our variational training paradigm through a computationally efficient algorithm, Variational Automatic Curriculum Learning (VACL). VACL consists two components, task expansion and entity progression, which generate a series of effective training tasks in a hierarchical manner. Intuitively, entity progression leverages the inductive bias that tasks with more entities are usually more difficult in MARL and, therefore, progressively increases the entity size in the environment. Task expansion assumes a fixed number of entities and is motivated by Stein variational gradient descent (SVGD) to efficiently expand the task distribution towards the entire task space. More details can be found in our paper.
To intuitively understand the task distribution in the active set, we visualize the 2D projections of particles in the active set (orange) over the entire task space (blue) through out the training process. We can find that the current task distribution gradually converges to the uniform task distribution during the training process.
2D projections of particles
In Simple-Spread, agents get +4 reward when all the landmarks are occupied and get a reward of -1 when any agents collide with each other. The agents are only penalized by collision once at every timestep. In Push-Ball, there are n agents, n balls and, n landmarks. Agents will get a shared reward 2/n per timestep when any ball occupies one landmark. If all of the landmarks are occupied, agents will get an extra +1 reward. The collision penalty is the same as Simple-Spread.
Simple-Spread (n = 100)
We compare VACL with 5 baselines, including (1) multi-agent PPO with uniform task sampling (Uniform), (2) population curriculum only (PC-Unif ), (3) reverse curriculum generation (RCG) , (4) automatic goal generation (GoalGAN) , which uses a GAN to generate training tasks, (5) adversarially motivated intrinsic goals (AMIGo) , which learns a teacher to generate increasingly challenging goals. The results show that VACL outperforms all baselines with a clear margin.
We also consider the setting of massive agents on Simple-Spread. We compare VACL with two existing works, the attentional communication (ATOC) method, which adopts intra-agent communication channels and trains n = 50 and n = 100 agents with dense rewards and evolutionary population curriculum (EPC), which reports the coverage rate with n = 24 agents. EPC adopts dense rewards and separate policies for each agent. VACL achieves a 98% coverage rate with n = 100 agents, which outperforms the highest number ever reported to the best of our knowledge. We also visualize the policy on the left.
Ramp-Use is a single-agent scenario for the ramp use strategy, with 1 ramps, 1 movable seeker and 1 fixed hider in the quadrant room. we hope the seeker to learn how to use ramps to enter the enclosed room and catch the hider. When the hider is spotted, the seeker gets a reward of +1. Otherwise, he gets -1. In Lock-and-Return, Agents need to lock all the boxes and return to their birthplaces. Agents get a reward of +0.2 when all boxes are locked and another success reward +1 when the task is finished, i.e., all boxes locked and all agents back to the birthplaces. The two environments are fully-observable.
In the Hide-and-Seek scenarios, VACL achieves over 90% success rates on both two games, including reproducing the ramp use behavior and solving a sparse-reward multi-agent Lock-and-Return challenge where none of the baselines can produce a success rate above 15% using the same amount of training timesteps. We also visualize the ramp use behavior.