In this paper, we provide our insight into this question:
Whether an elaborate combination of principles from automatic curriculum learning and hierarchical learning can enable complex cooperation with sparse reward in MARL?
Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by
(1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and
(2) the non-stationarity of the teacher's task due to ever-changing student strategies.
As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.
SPC Framework
The contextual bandit teacher uses an RNN-based imitation model to represent student policies and generate the bandit’s context.
Population-invariant communication in the student module is implemented to handle varying number of agents across tasks. By treating each agent’s message as a word and using a self-attention communication channel, SPC supports an arbitrary number of agents to share messages.
A hierarchical skill framework is used in the student module to learn transferable skills in the sparse reward setting, where agents communicate on the high-level about a set of shared low-level policies.
Experiments
In the experiments, we aim to investigate the following research questions:
Q1: Is curriculum learning necessary in complex largescale MARL problems?
Q2: Can SPC outperform previous curriculum-based MARL methods? If so, which components of SPC contribute the most to performance gains?
Q3: Can SPC effectively learn a curriculum for the student?
We provide clean implementation of SPC and all the baselines based on Ray RLlib.
We will release the link to our github repository after communication of the accept/reject decisions.
In the Google Research Football (GRF) 5v5 scenario, SPC can achieve about 80% win rate with +3 goal difference. We demonstrate one of the game replays and learning curves here.