R3L: The Ingredients of Real World Robotic Reinforcement Learning

Henry Zhu*, Justin Yu*, Abhishek Gupta*, Dhruv Shah, Kristian Hartikainen, Avi Singh, Vikash Kumar, Sergey Levine

International Conference on Learning Representations, 2020

ICLR 2020 version (OpenReview), Blog, Arxiv,

Motivation

  • Current schemes for training robots with reinforcement learning (RL) in the real world require task-specific environment instrumentation to acquire ground truth state, provide reward information, and perform resets.

  • Scalable robotic learning systems should only assume access to on-board sensors, such as proprioceptive information (i.e. joint angles and velocities) and an inexpensive RGB camera for pixel inputs.

  • Limiting the amount of hardware instrumentation allows for better consistency among setups to enable data sharing and training parallelization.

Challenges

(1) Learning Without Resets

In reinforcement learning, training typically assumes access to total control over the environment, allowing for resets to occur between trajectories to some specified initial state distribution. In the real world, total control over an environment is hard to obtain, which leads to the necessity of learning without resets.

(2) Learning from Raw Sensory Input

A real world system typically does not have access to the underlying state information and instead needs to learn from high dimensional sensor inputs.

(3) Reward Functions Without Reward Engineering

Without instrumentation, reward typically needs to be specified by a human in the loop throughout the entire duration of a training run. Thus, introducing some kind of learned reward signal greatly reduces the amount of human involvement from start to finish.

Real World Robotic Reinforcement Learning

Using the framework of challenges above, we show that a particular instantiation of a real world robotic system can learn tasks in the real world without environment instrumentation and prolonged human supervision by utilizing (1) perturbation controllers, (2) unsupervised representation learning, and (3) online goal classifiers.

Hardware Experiments

Bead Manipulation Task

Valve Reorienting Task

Simulation Experiments

The goal in each video is specified as a goal image in the bottom left corner. These goal images are examples of what would be provided at the beginning of training to the reward classifier.

Free Object Reorienting

Bead Repositioning

Valve Rotation