Efficient sequential Meta-RL without gathering new experience on prior tasks
We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting, the agent's goal is to achieve high reward over any sequence of tasks quickly. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. However, they require access to all tasks during training. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in an incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning. We find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods on several sequences of challenging continuous control tasks.
Real robots can’t revisit old tasks to collect on-policy data
How can we train meta-RL without revisiting prior tasks?
Need an off-policy MAML-based meta-learning algorithm
Method: Continual Meta-Policy Search (CoMPS)
We construct a continual meta learning method for efficient sequential multi-task learning
The algorithm uses reinforcement learning to interact with the world and collect experience on a single task.
The reinforcement learning policy starts with a set of policy parameters
The agent trains these parameters on task k. During reinforcement learning training two collections of experience are kept, all experience and the best experience which is, the experience that achieved the highest average reward.
This data is saved and stored for meta-training
Our offline meta-self imitation method trains a policy to be capable of matching the performance of the best experience on prior tasks quickly.
The policy parameters from the meta-self imitation learning process are used to initialize the RL policy for the next task.
This makes collecting the best experience occur faster on the new task. This process repeats itself for each new task.
Can CoMPS train without revisiting prior tasks?
How sample efficient is CoMPS?
Does learning efficiency increase as more tasks are solved?
Sequential task learning on Ant Goal
Ant agents trained with CoMPS can successfully reach designated goals on the plane
Agents trained with CoMPS tracking target velocities.
Agents trained with CoMPS can successfully perform complex Metaworld tasks
We experimented with both stationary task distributions and non-stationary task distributions.
Stationary task distributions
Tasks are sampled from a uniform distribution without replacement. We can see that CoMPS is able to solve the tasks with the shortest number of episodes required. PNC and PPO struggle with this type of problem setting where an algorithm that can transfer experience forward to new tasks is important
Non-Stationary task distributions
Task sequences are structured to be diverse, with new tasks being different from past tasks. We can see that CoMPS is able to solve the tasks with the largest accumulated returns. The prior meta learning methods MAML, PEARL and GMPS + PPO all struggle with this type of problem setting as they can not make adequate use of prior experience for meta training.
Introduced the sequential multi-task setting for reinforcement learning
This setting is more realistic for agents that can only interact with one task at a time.
A new Meta-RL algorithm that outperforms prior methods
Performance improves as more task are solved
The code to reproduce the experiments can be downloaded here