CoMPS

Continual Meta Policy Search (CoMPS)

Efficient sequential Meta-RL without gathering new experience on prior tasks

Abstract

We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting, the agent's goal is to achieve high reward over any sequence of tasks quickly. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. However, they require access to all tasks during training. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in an incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning. We find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods on several sequences of challenging continuous control tasks.

Meta-RL Training Limitation

Real robots can’t revisit old tasks to collect on-policy data
How can we train meta-RL without revisiting prior tasks?
Need an off-policy MAML-based meta-learning algorithm

Method: Continual Meta-Policy Search (CoMPS)

We construct a continual meta learning method for efficient sequential multi-task learning

The algorithm uses reinforcement learning to interact with the world and collect experience on a single task.
The reinforcement learning policy starts with a set of policy parameters
The agent trains these parameters on task k. During reinforcement learning training two collections of experience are kept, all experience and the best experience which is, the experience that achieved the highest average reward.
This data is saved and stored for meta-training
Our offline meta-self imitation method trains a policy to be capable of matching the performance of the best experience on prior tasks quickly.
The policy parameters from the meta-self imitation learning process are used to initialize the RL policy for the next task.
This makes collecting the best experience occur faster on the new task. This process repeats itself for each new task.

Evaluation

Can CoMPS train without revisiting prior tasks?
How sample efficient is CoMPS?
Does learning efficiency increase as more tasks are solved?

Sequential task learning on Ant Goal

Ant Goal

Ant agents trained with CoMPS can successfully reach designated goals on the plane

Top Left Goal

Top Goal

Bottom Left Goal

Half Cheetah

Agents trained with CoMPS tracking target velocities.

Forward 1.2 m/sec

Backward 1.6 m/sec

Backward 2.0 m/sec

Metaworld

Agents trained with CoMPS can successfully perform complex Metaworld tasks

Button Press Wall

Push Wall

Hammer

We experimented with both stationary task distributions and non-stationary task distributions.

Stationary task distributions

Tasks are sampled from a uniform distribution without replacement. We can see that CoMPS is able to solve the tasks with the shortest number of episodes required. PNC and PPO struggle with this type of problem setting where an algorithm that can transfer experience forward to new tasks is important

Non-Stationary task distributions

Task sequences are structured to be diverse, with new tasks being different from past tasks. We can see that CoMPS is able to solve the tasks with the largest accumulated returns. The prior meta learning methods MAML, PEARL and GMPS + PPO all struggle with this type of problem setting as they can not make adequate use of prior experience for meta training.

Findings

Introduced the sequential multi-task setting for reinforcement learning
This setting is more realistic for agents that can only interact with one task at a time.
A new Meta-RL algorithm that outperforms prior methods
1. 1. 1. - Performance improves as more task are solved

Paper

Video Presentation

Code

The code to reproduce the experiments can be downloaded here

Page updated

Google Sites

Report abuse