MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning
Kevin Li*, Abhishek Gupta*, Ashwin Reddy, Vitchyr Pong, Aurick Zhou, Justin Yu, Sergey Levine
UC Berkeley
Presented at the 2021 International Conference on Machine Learning (ICML)
Motivation
Reinforcement learning (RL) in its most general form requires solving a challenging uninformed search problem, in which rewards are sparsely observed
Shaped reward functions can help guide learning, but often require domain knowledge and can be misleading if not designed carefully
We aim to reformulate the RL problem to:
Make it easier to specify the task by simply providing examples of successful outcomes
Make learning more tractable by using the success examples for directed exploration
Code
Code is available at https://github.com/mural-rl/mural
MURAL Algorithm
The user provides examples of desired outcomes. There is no need for demonstrations or any additional guidance.
At each iteration of RL, we use states visited by the policy as negative examples, then train our classifier to distinguish between visited states and goals.
We run RL, where rewards are obtained by querying the classifier with an amortized version of Normalized Maximum Likelihood. We use p_NML(goal | state) as the reward.
Meta-Learning Normalized Maximum Likelihood (Meta-NML)
Normalized Maximum Likelihood (NML) requires retraining our model to convergence on every new test point, which would be computationally intractable on neural network models. We propose a novel meta-learning variant of NML (meta-NML) which learns a model initialization that can easily adapt to new points with arbitrary labels in just one or a few gradient steps.
Using meta-NML, we can approximate the desired NML outputs reasonably well in a fraction of the time, resulting in a ~2000x speedup compared to standard NML. This is crucial to making an NML based classifier method computationally tractable for RL.
Evaluation Domains
We evaluate our algorithm on a variety of reinforcement learning tasks:
Robotic manipulation
Three robotic tasks — pushing, door opening, and pick-and-place — with a Sawyer robot arm, previously considered in VICE-RAQ (Singh et al. 2019). We use the ground truth robot state (e.g. end effector and gripper positions); details on the state spaces are provided below for each environment.
One dexterous manipulation task which involves controlling a robot claw with a high action space (16 DoF) to move an object to a desired location.
Navigation
Two 2D maze navigation problems, which require avoiding several local optima before reaching the goal. Previous classifier methods often get stuck on the other side of a wall, believing they are close enough to the desired goal.
One ant locomotion task, where a 15 DoF quadruped ant must navigate around a wall to the desired goal. For an algorithm to succeed on this task, in addition to being able to avoid local optima like in the other mazes, it must also scale well to higher state and action spaces.
Sawyer Push
The robot must push a puck to a fixed location.
State space: x, y, z coordinates of end effector and x, y coordinates of puck
Sawyer Door Opening
The robot must open the door to a 45 degree angle.
State space: x, y, z coordinates of end effector and angle of door
Sawyer Pick-and-Place
The robot must pick up a randomly placed ball from a table and raise it to a fixed location.
State space: x, y, z coordinates of end effector; x, y, z coordinates of ball; tightness of gripper
Dexterous Hand
The multi-fingered (16 DoF) hand must move the object to a fixed location.
State space: x, y, z coordinates of end effector
Zigzag Maze
The agent must navigate through an S-shaped corridor consisting of two walls.
State space: x and y coordinates of agent
Spiral Maze
The agent must navigate through a spiral-shaped corridor.
State space: x and y coordinates of agent
Ant Locomotion
The quadruped ant must navigate around a wall to the desired target.
State space: Center of mass, joint positions, and joint angles of the ant
Results
We compare to prior classifier-based RL methods (VICE), goal-reaching algorithms (DDL), exploration bonuses (RND and counts), a heuristically shaped reward function, and a sparse reward.
Our algorithm is able to very quickly learn how to solve these challenging exploration tasks, often reaching significantly better asymptotic performance than most prior methods, and doing so significantly more efficiently. This suggests that MURAL is able to provide directed reward shaping and exploration that is substantially better than standard classifier-based methods (e.g., VICE).