Parrot: Data-Driven Behavioral Priors for Reinforcement Learning

Avi Singh*, Huihan Liu*, Gaoyue Zhou, Albert Yu, Nick Rhinehart, Sergey Levine

(* denotes equal contribution)

Oral at International Conference on Learning Representations (ICLR), 2021

link to paper

UC Berkeley


Method

  1. Start with a diverse multi-task dataset

This can be a dataset of a wide range of different tasks. As shown in the figure, for each scene, we include trajectories of different agent behaviors in the dataset, e.g. grasp the cub, or place the bottle on the cube.

2. Train a behavioral prior using the dataset

The behavioral prior learns an invertible mapping that maps noise to useful action. This mapping is conditioned on the current observation, which we obtain by passing our observation, an RGB image, to a convolutional neural network.

3. Use behavioral prior to bootstrap exploration for new tasks

Instead of learning a policy that directly executes its actions in the original MDP, we learn a policy that outputs z which is taken by the behavioral prior as input. We then execute the output from the behavioral prior in the environment.

Learning with a Behavioral Prior

We visualize trajectories from executing a random policy, with and without the behavioral prior. We see that the behavioral prior substantially increases the likelihood of executing an action that is likely to lead to a meaningful interaction with an object, while still exploring a diverse set of actions.

Without Behavioral Prior

With Behavioral Prior

Evaluation Tasks

We evaluated our method on eight tasks (shown below). For each task, the positions of all objects in the scene is randomized at the start of every episode. We plotted the performance for each task in the following section.

Place Can in Pan

Place Sculpture in Basket

Place Chair on Checkerboard Table

Place Baseball Cap on Block

Pick up Bar

Pick up Sculpture

Pick up Cup

Pick up Baseball Cap

Results

PARROT is able to learn much faster than prior methods on a majority of the tasks, and shows little variance across runs. Note that some methods that failed to make any progress on certain tasks (such as “Place Sculpture in Basket”) overlap each other with a success rate of zero.


Video