R3L: The Ingredients of Real World Robotic Reinforcement Learning
Henry Zhu*, Justin Yu*, Abhishek Gupta*, Dhruv Shah, Kristian Hartikainen, Avi Singh, Vikash Kumar, Sergey Levine
International Conference on Learning Representations, 2020
Motivation
Current schemes for training robots with reinforcement learning (RL) in the real world require task-specific environment instrumentation to acquire ground truth state, provide reward information, and perform resets.
Scalable robotic learning systems should only assume access to on-board sensors, such as proprioceptive information (i.e. joint angles and velocities) and an inexpensive RGB camera for pixel inputs.
Limiting the amount of hardware instrumentation allows for better consistency among setups to enable data sharing and training parallelization.
Challenges
(1) Learning Without Resets
In reinforcement learning, training typically assumes access to total control over the environment, allowing for resets to occur between trajectories to some specified initial state distribution. In the real world, total control over an environment is hard to obtain, which leads to the necessity of learning without resets.
(2) Learning from Raw Sensory Input
A real world system typically does not have access to the underlying state information and instead needs to learn from high dimensional sensor inputs.
(3) Reward Functions Without Reward Engineering
Without instrumentation, reward typically needs to be specified by a human in the loop throughout the entire duration of a training run. Thus, introducing some kind of learned reward signal greatly reduces the amount of human involvement from start to finish.
Real World Robotic Reinforcement Learning
Using the framework of challenges above, we show that a particular instantiation of a real world robotic system can learn tasks in the real world without environment instrumentation and prolonged human supervision by utilizing (1) perturbation controllers, (2) unsupervised representation learning, and (3) online goal classifiers.
Hardware Experiments
Bead Manipulation Task
Valve Reorienting Task
Simulation Experiments
The goal in each video is specified as a goal image in the bottom left corner. These goal images are examples of what would be provided at the beginning of training to the reward classifier.