MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning

Kevin Li*, Abhishek Gupta*, Ashwin Reddy, Vitchyr Pong, Aurick Zhou, Justin Yu, Sergey Levine

UC Berkeley

Paper | Blog Post | Code

Presented at the 2021 International Conference on Machine Learning (ICML)


  • Reinforcement learning (RL) in its most general form requires solving a challenging uninformed search problem, in which rewards are sparsely observed

  • Shaped reward functions can help guide learning, but often require domain knowledge and can be misleading if not designed carefully

  • We aim to reformulate the RL problem to:

    • Make it easier to specify the task by simply providing examples of successful outcomes

    • Make learning more tractable by using the success examples for directed exploration


Code is available at

MURAL Algorithm

  1. The user provides examples of desired outcomes. There is no need for demonstrations or any additional guidance.

  2. At each iteration of RL, we use states visited by the policy as negative examples, then train our classifier to distinguish between visited states and goals.

  3. We run RL, where rewards are obtained by querying the classifier with an amortized version of Normalized Maximum Likelihood. We use p_NML(goal | state) as the reward.

Meta-Learning Normalized Maximum Likelihood (Meta-NML)

Normalized Maximum Likelihood (NML) requires retraining our model to convergence on every new test point, which would be computationally intractable on neural network models. We propose a novel meta-learning variant of NML (meta-NML) which learns a model initialization that can easily adapt to new points with arbitrary labels in just one or a few gradient steps.

Using meta-NML, we can approximate the desired NML outputs reasonably well in a fraction of the time, resulting in a ~2000x speedup compared to standard NML. This is crucial to making an NML based classifier method computationally tractable for RL.

Evaluation Domains

We evaluate our algorithm on a variety of reinforcement learning tasks:

Robotic manipulation

  • Three robotic tasks — pushing, door opening, and pick-and-place — with a Sawyer robot arm, previously considered in VICE-RAQ (Singh et al. 2019). We use the ground truth robot state (e.g. end effector and gripper positions); details on the state spaces are provided below for each environment.

  • One dexterous manipulation task which involves controlling a robot claw with a high action space (16 DoF) to move an object to a desired location.


  • Two 2D maze navigation problems, which require avoiding several local optima before reaching the goal. Previous classifier methods often get stuck on the other side of a wall, believing they are close enough to the desired goal.

  • One ant locomotion task, where a 15 DoF quadruped ant must navigate around a wall to the desired goal. For an algorithm to succeed on this task, in addition to being able to avoid local optima like in the other mazes, it must also scale well to higher state and action spaces.

Sawyer Push

The robot must push a puck to a fixed location.

State space: x, y, z coordinates of end effector and x, y coordinates of puck

Sawyer Door Opening

The robot must open the door to a 45 degree angle.

State space: x, y, z coordinates of end effector and angle of door

Sawyer Pick-and-Place

The robot must pick up a randomly placed ball from a table and raise it to a fixed location.

State space: x, y, z coordinates of end effector; x, y, z coordinates of ball; tightness of gripper

Dexterous Hand

The multi-fingered (16 DoF) hand must move the object to a fixed location.

State space: x, y, z coordinates of end effector

Zigzag Maze

The agent must navigate through an S-shaped corridor consisting of two walls.

State space: x and y coordinates of agent

Spiral Maze

The agent must navigate through a spiral-shaped corridor.

State space: x and y coordinates of agent

Ant Locomotion

The quadruped ant must navigate around a wall to the desired target.

State space: Center of mass, joint positions, and joint angles of the ant


We compare to prior classifier-based RL methods (VICE), goal-reaching algorithms (DDL), exploration bonuses (RND and counts), a heuristically shaped reward function, and a sparse reward.

Our algorithm is able to very quickly learn how to solve these challenging exploration tasks, often reaching significantly better asymptotic performance than most prior methods, and doing so significantly more efficiently. This suggests that MURAL is able to provide directed reward shaping and exploration that is substantially better than standard classifier-based methods (e.g., VICE).