(* denotes equal contribution)
This can be a dataset of a wide range of different tasks. As shown in the figure, for each scene, we include trajectories of different agent behaviors in the dataset, e.g. grasp the cub, or place the bottle on the cube.
The behavioral prior learns an invertible mapping that maps noise to useful action. This mapping is conditioned on the current observation, which we obtain by passing our observation, an RGB image, to a convolutional neural network.
Instead of learning a policy that directly executes its actions in the original MDP, we learn a policy that outputs z which is taken by the behavioral prior as input. We then execute the output from the behavioral prior in the environment.
We visualize trajectories from executing a random policy, with and without the behavioral prior. We see that the behavioral prior substantially increases the likelihood of executing an action that is likely to lead to a meaningful interaction with an object, while still exploring a diverse set of actions.
We evaluated our method on eight tasks (shown below). For each task, the positions of all objects in the scene is randomized at the start of every episode. We plotted the performance for each task in the following section.
Place Can in Pan
Place Sculpture in Basket
Place Chair on Checkerboard Table
Place Baseball Cap on Block
Pick up Bar
Pick up Sculpture
Pick up Cup
Pick up Baseball Cap
PARROT is able to learn much faster than prior methods on a majority of the tasks, and shows little variance across runs. Note that some methods that failed to make any progress on certain tasks (such as “Place Sculpture in Basket”) overlap each other with a success rate of zero.