Guided Meta-Policy Search

Russell Mendonca , Abhishek Gupta , Rosen Kralev , Pieter Abbeel , Sergey Levine , Chelsea Finn

Abstract: Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks in order to more quickly solve new tasks. However, in practice, these algorithms generally also require impractical amounts of on-policy experience during the meta-training process. Unlike these algorithms, we propose to learn a reinforcement learning procedure through imitation of expert policies that solve previously-seen tasks. This involves a nested optimization, with RL in the inner loop and supervised imitation learning in the outer loop. Because the outer loop imitation learning can be done with off-policy data, we can achieve significant gains in meta-learning sample efficiency. In this paper, we show how this general idea can be used both for meta-reinforcement learning and for learning fast RL procedures from multi-task demonstration data. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work.

Code : https://github.com/RussellM2020/GMPS


Environments

For each of the environments, the goal locations are not provided in the observations at either train or test time, and must be inferred from the reward signal. They have been added in the videos only for visualization purposes.

Sawyer Pushing

Each task consists of pushing the block to a goal position sampled from the green region shown above. (The block has side 2cm , and the goal region is 20cmX10cm). The initial position of the block is fixed across tasks.

Sawyer Door Opening

Each task consists of opening the door to a different angle, where the range of target angles is from 0 to 60 degrees. The door position is fixed across tasks.

Ant Locomotion

Each task consists of the ant walking to a point sampled from a quadrant.(The points on the quadrant are at a distance of 2m from the origin, and the quadrant is given by the 3 points shown above.)The initial position of the ant is fixed across tasks.

Sawyer Pushing from Image Observations

Reward provided through distance between the pusher and block and distance between block and goal

Goal position has been added for visualization purposes only and is not provided to the policy

GMPS : Performance on New Tasks (<5 adaptation steps)

Validation Task 0

Validation Task 1

Validation Task 2

Validation Task 3

MAML : Performance on New Tasks (5 adaptation steps) [X14 times more samples for meta-training]

Validation Task 0

Validation Task 1

Validation Task 2

Validation Task 3

Sawyer Door Opening

Reward provided only through difference between target angle and current angle of the door

Goal angle has been added for visualization purposes only and is not provided to the policy

GMPS : Performance on New Tasks (<5 adaptation steps)

Validation Task 0

Validation Task 1

Validation Task 2

Validation Task 3

MAML , MAESN : Performance on New Tasks ( Reward signal never detected during training)

MAML Validation Task 0

MAML Validation Task 1

MAESN Validation Task 2

MAESN Validation Task 3