MELD: Meta-Reinforcement Learning from Images via Latent State Models

Tony Z. Zhao, Anusha Nagabandi, Kate Rakelly*, Chelsea Finn, Sergey Levine

UC Berkeley, Stanford University

Conference on Robot Learning (CoRL), 2020

Abstract

Meta-reinforcement learning algorithms can enable autonomous agents, such as robots, to quickly acquire new behaviors by leveraging prior experience in a set of related training tasks. However, the onerous data requirements of meta-training compounded with the challenge of learning from sensory inputs such as images have made meta-RL challenging to apply to real robotic systems. Latent state models, which learn compact state representations from a sequence of observations, can accelerate representation learning from visual inputs. In this paper, we leverage the perspective of meta-learning as task inference to show that latent state models can also perform meta-learning given an appropriately defined observation space. Building on this insight, we develop meta-RL with latent dynamics (MELD), an algorithm for meta-RL from images that performs inference in a latent state model to quickly acquire new skills given observations and rewards. MELD outperforms prior meta-RL methods on several simulated image-based robotic control problems, and enables a real WidowX robotic arm to insert an Ethernet cable into new locations given a sparse task completion signal after only 8 hours of real world meta-training. To our knowledge, MELD is the first meta-RL algorithm trained in a real-world robotic control setting from images.

Full paper: http://arxiv.org/abs/2010.13957

Code: https://github.com/tonyzhaozh/meld

MELDing State and Task Inference

Left: Robots often have incomplete knowledge of the full state, as when operating from images or other sensor data x. This problem can be tackled by learning a latent dynamics model to glean state information z from a history of observations and actions.
Middle: The meta-RL problem considers a task distribution - the identity of the current task T is an unobserved variable that controls the dynamics and reward functions.
Right (MELD) : We interpret the task identity as part of the underlying state, allowing us to leverage latent state models for efficient partially observed meta-RL.

Meta-learning from Images in Real World

MELD enables the WidowX robot to insert the Ethernet cable into the correct port with the router in a novel location and orientation within 2 trials of experience.

Only pixel observations are given to the robot and we directly control joints velocities of the 5-dof robot arm. We assume dense reward during meta-training and only sparse reward during meta-testing. We obtain the final policy after 80k samples, which amounts to 8 hours in the real world.

Temporally-Extended Exploration

In this environment, only pixel observations are available and the policy directly control the joints of the 7-dof Sawyer robot. It needs to touch the correct button on the control panel (illustrated with a white dot, for visualization purposes only) in order to obtain sparse reward signal. During meta-training, we train on tasks with different control panel position as well as different goal buttons. We assume availability of dense reward as prediction target to accelerate training. During meta-test time, only sparse reward is available to the robot, and it is tested on tasks unseen during training.

To succeed at identifying the correct button to push at test time, the robot must acquire efficient exploration strategies during meta-training. As shown on the right, in the first episode, the robot explores each button until it receives reward at the blue button. In the second episode (bottom), the robot returns immediately to the correct button.

On the right, we plot the reward reconstruction's mean and variance over time as the robot explores. In the first episode, both error and variance are high until the robot finds the correct button. In the second episode, the robot remembers the task and is able to predict with near zero error and variance.

Simulation Results

We evaluate MELD in 4 problems involving locomotion and manipulation. In the first problem, we control a 6-dof cheetah robot, while in the remainder of the problems we control the 7-dof Sawyer robot. In these problems reward is given at each time step and corresponds to the error between the current robot state and desired state. For each problem, we define a "task success metric" that corresponds to qualitatively solving the task. We plot this success threshold as a dashed black line in each plot.

We compare to the following baselines:

PEARL (Rakelly et al. 2019) - meta-RL approach that models a latent task variable but does not perform latent state estimation. For fair comparison, we augment the PEARL encoder network with the same convolutional encoder used in MELD.
RL2 (Wang et al. 2016, Duan et al. 2016) - meta-RL approach that models the policy as a recurrent neural network. For fair comparison, we use the same convolutional encoder used in MELD.
SLAC - a latent state model method that estimates a latent state from a window of image observations, but does not perform meta-learning.

Cheetah

Running at different (unknown) target velocities

Reward: error between current robot velocity and target velocity

Reacher

Reaching to different (unknown, shown for visualization only) goal positions

Reward: distance from end-effector to goal

Peg Insertion

Inserting a peg into a goal box (unknown, shown for visualization only)

Reward: distance from end-effector to insertion point

Shelf Placing

Placing a mug onto the correct location on the shelf. The weight of the mug and the target location change with each task. This problem illustrates both reward function and dynamics changing across tasks.

Reward: distance from end-effector to placing location

Multitask Learning Results

Inserting a peg into one of 3 possible goal locations

We use a 7DoF Saywer robotic arm to perform peg insertion in the real world. In this experiment, there are 3 tasks corresponding to different insertion locations.

The policy sends actions over a ROS interface to a low-level PID controller to move the joints of the robot. The reward function is the L2-norm of the translational distances between the current pose of the peg and the goal pose. Note that the goal pose is not provided to the robot, but must be inferred from its history of observations. The observations seen by the robot consist of this reward signal as well as images from the two webcams: one fixed view from the overhead camera and one first-person view from the wrist-mounted camera.

The robot succeeds at all three tasks after collecting 160, 000 samples, or 11 hours of data collection time.

Algorithm Details

MELD Meta-training

Meta-training alternates between training the latent state model and training the actor and critic (which are conditioned on the model's posterior over the latent state z). Batches of trajectories from a set of training tasks are sampled from the replay buffer and fed into the latent state model which infers the posterior over latent state z at each time step given the previous latent state and the current observation, action, and reward. To capture both state and task information in the latent state, the model is trained to minimize error between true observations and rewards and their reconstructions, subject to a KL divergence constraint that compresses the latent representation.

For more details, please refer to paper and code linked below!

Paper: http://arxiv.org/abs/2010.13957