Offline Meta-Reinforcement with Online Self-Supervision

Vitchyr Pong, Ashvin Nair, Laura Smith, Catherine Huang, Sergey Levine
University of California, Berkeley


The Problem

  • Meta reinforcement learning (RL) trains a meta-policy to quickly adapt to a new task, given transition histories collected by a learned exploration policy.

  • Offline meta RL trains the exploration and meta-policy using a fixed data-set of transitions. However, at meta-test time, the exploration policy's trajectory distribution may differ from the states in the offline replay buffer, resulting in a distribution shift in the data used for adaptation.

  • Adaptation data distribution shifts results in distribution shift for the meta-learned context produced by the adaptation procedure.

Our Approach: Self-supervised Meta Actor-Critic (SMAC)

  • We propose a two stage training procedure that addresses the distribution shift, by performing offline meta RL training followed by a self-supervised phase.

  • In the self-supervised phase, the meta-policy can interact with the environment but without additional reward labels.


  • We propose a new meta-learning evaluation domain based on the environment from this paper.

    • A simulated Sawyer gripper can perform various manipulation tasks such as pushing a button, opening drawers, and picking and placing objects.

    • We evaluate on held-out tasks in which different objects may be present and with completely different locations.

    • See example offline trajectories.

  • We also evaluate on standard meta-learning tasks (Cheetah & Ant) .


  • SMAC significantly improves performance over prior offline meta RL methods.

  • SMAC matches performance of oracle comparison that receives reward labels even during the online interactions.

More on the Distribution Shift Problem

Offline RL algorithms suffer from the RL distribution shift where the states seen by the learned policy differ from the states in the offline replay buffer that the policy was trained on, but offline meta RL must deal with an additional distribution shift that arises from the adaptation procedure. Meta RL algorithms learn a fast adaptation procedure that map a history of transitions to a context variable z. Different meta RL algorithms have different representations this context variable: z could be the weights of a neural network, the hidden activations of an RNN, or latent variables generated by a neural network. The commonality of these meta RL methods have is that this context is used to condition a post-adaptation policy. In offline meta-RL, the adaptation procedure is trained using data sampled from a fixed, offline dataset. At meta-test time, the distribution of data that the learned exploration policy collects will differ from the states in the offline replay buffer, and the resulting distribution over z will change. We hypothesize that the policy will perform poorly at meta-test time due to the distribution shift in z-space. To test this hypothesis, we quantify this distribution shift and measure the difference in performance.

We compare the posterior distribution of latent variables z when conditioned on the rollouts from the behavioral policy compared with the learned exploration policy. The differences in the KL divergence distributions of the two policies imply the presence of a distribution shift.

We also compare the post-adaptation performance when using offline data or rollouts from the exploration policy for adaptation. The distribution shift in z-space causes the policy to perform poorly.

Self-supervised Meta Actor-Critic (SMAC) Details

To address this distribution shift, we introduce an additional assumption: we assume that the agent can interact with the environment without additional reward labels. These additional interactions enable the agent to observe trajectories collected by the learned exploration policy. To meta-train on these trajectories, we train a reward-decoder to label these trajectories with self-generated rewards.

Offline Phase

In the offline phase, we use reward-labeled data from an offline data buffer to learn a context encoder while running an actor critic algorithm (in our case advantage weighted actor critic, AWAC ) to update the Q-function and policy. The critic is updated by minimizing the Bellman error:

Our actor update is based on advantage weighted actor critic (AWAC). AWAC bootstrapping error accumulation that occurs when the target Q-function is evaluated at actions a' outside of the training data by implicitly regularizing the learned policy to stay near the offline dataset policy. The AWAC actor date minimizes

To prepare for the self-supervised phase, we additional train a stochastic reward decoder that generates rewards, conditioned on the latent variable z We train the context encoder and reward decoder jointly by back propagating the context loss which consists of a reward loss portion and a KL-loss, where the latter regularizes context encoder.

Although AWAC succeeds in lessening the policy distribution shift, it does not address the z-space distribution shift. We mitigate the effects of the z-space shift in the self-supervised phase by collecting additional, online data.

Online Phase

In the self-supervised phase, we use the reward decoder pretrained in the offline phase to self-generate reward labels as the policy interacts with the environment. We train the Q-function and policy on rollouts from the exploration policy. We collect exploration rollouts by sampling the context variable z from the prior distribution and label the rewards using the reward decoder.

Summary Video