REBOOT: Reuse Data for Bootstrapping Efficient Real-World Dexterous Manipulation

Zheyuan Hu1*, Aaron Rovinsky1*, Jianlan Luo1, Vikash Kumar2, Abhishek Gupta3, Sergey Levine1
1 UC Berkeley, 2 Meta AI Research, 3 University of Washington
* Equal Contribution

Accepted at CoRL 2023

TL;DR: We introduce a new system that learns dexterous manipulation skills autonomously with sample-efficient RL by bootstrapping from prior data. The system is tested on a multi-fingered robot hand learning complex in-hand rotation tasks.

Supplemental Video

Our method utilizes an autonomous reset pipeline to train in-hand manipulation policies, where the reset and in-hand manipulation policies alternate in a forward-backward fashion.

Abstract

Dexterous manipulation tasks involving contact-rich interactions pose a significant challenge for both model-based control systems and imitation learning algorithms. The complexity arises from the need for multi-fingered robotic hands to dynamically establish and break contacts, balance non-prehensile forces, and control large degrees of freedom. Reinforcement learning (RL) offers a promising approach due to its general applicability and capacity to autonomously acquire optimal manipulation strategies. However, its real-world application is often hindered by the necessity to generate a large number of samples, reset the environment, and obtain reward signals. In this work, we introduce an efficient system for learning dexterous manipulation skills with RL to alleviate these challenges. The main idea of our approach is the integration of recent advancements in sample-efficient RL and replay buffer bootstrapping. This unique combination allows us to utilize data from different tasks or objects as a starting point for training new tasks, significantly improving learning efficiency. Additionally, our system completes the real-world training cycle by incorporating learned resets via an imitation-based pickup policy as well as learned reward functions, eliminating the need for manual resets and reward engineering. We demonstrate the benefits of reusing past data as replay buffer initialization for new tasks, for instance, the fast acquisition of intricate manipulation skills in the real world on a four-fingered robotic hand.

Robot Setup

In this work, we use a custom-built, 4-finger, 16-DoF robot hand mounted to a 7-DoF Sawyer robotic arm for dexterous object manipulation tasks as shown in the figure. Our focus is on learning in-hand reorientation skills with reinforcement learning. During the in-hand manipulation phase, the RL policy controls the 16 degrees of freedom of the hand, setting target positions at 10 Hz, with observations provided by the joint encoders in the finger motors and two RGB cameras, one overhead and another embedded in the palm of the hand. To facilitate autonomous training, we also use imitation learning to learn a reset policy to pick up the object from the table in-between in-hand manipulation trials. This imitation policy uses a 19-dimensional action space, controlling the end effector position of the wrist and 16 finger joints to pick up the object from any location. For collecting reset demonstrations, we use a 3D mouse to teleoperate the robot arm's end effector positions and press a button on the mouse to grasp objects.

Palm camera

We designed the red plates to allow a 135 degrees Basler camera integrated into the palm, providing additional viewpoints for finger-object contacts.

3-Prong Object

The in-hand tasks for the purple 3-prong object consist of: pose A with one leg pointing forward, and pose B with the wide V-shape opening between two legs pointing forward.

T-Shaped Pipe

The T-shaped pipe is a real-world object used for plumbing. In our experiments, the task requires manipulating the pipe to a T-shape as shown in the image.

Football

The football is a real-world sports item. In our experiments, the task involves re-orienting the football in-hand such that the two ends align straight vertically in the palm.

System Overview: Our system design incorporates both ideas of dividing and unifying. To enable autonomous practicing, we recognize that the reset skills needed to perform uninterrupted training with various objects can be efficiently solved by collecting and imitating a decreasing number of demonstrations per object. Thus, we divide the training loop into two phases: the in-hand training phase with efficient RL and the reset phase with BC policy. For both phases, we use a shared vision encoder pre-trained on ImageNet data to reduce computation costs and allow learning from a mixed buffer at a high UTD ratio.

Task Trainings & Learned Behaviors

We evaluated the trained policy success rate at every 12000 steps for the three real-world in-hand manipulation tasks considered in the paper according to the following criteria:

Here is the learned behavior of the hand rotating the object into pose B. Pose B is bootstrapped from Pose A's replay buffer and trained with our method, achieving an 80% success rate in less than half of Pose A's training time.

When trained with the 3-pronged object's buffer and REBOOT, the hand learns delicate behavior that rotates the T-pipe object to its goal pose and achieves matching performance to the baseline in half of the training time. 

We bootstrap the football training with data from the 3-prong object and the T-pipe. In this more challenging task, the hand with our method exceeds the final performance of the baseline in half of the training time. 

Comparison to Baseline & Hyperparameters

This bar plot displays the training time required for each object to reach its respective target performance. Buffer initialization using our method leads to more than a 2x speedup across all of the objects compared to training from scratch.

To achieve this speed up, our training uses the following hyperparameters and settings:

Behavior Cloned Reset Policy Details

In most cases, our behavior cloned reset policy is capable of resetting the environment, or at least of making contact with the object, but there are a few states where the policy is unable to pick up or perturb the object in any way. In order to avoid getting stuck attempting unsuccessful resets in these states, we train two different reset policies. One is trained with reset demonstrations for multiple objects, while the other is trained with demonstrations for only the current experiment's object. For example, when running an experiment with the football, one policy is trained using reset demonstrations for the 3-pronged object, the T-shaped pipe, and the football, while the other is trained only with demonstrations for the football. At the start of each training episode, we select the multi-object reset policy with an 80% probability and the single-object reset policy with a 20% probability. Since the policies behave differently due to being trained on different data, states in which one policy might get stuck are unlikely to cause the same issue for the other policy, which enables training to continue even if one of the two policies is suboptimal.

Ablation Studies

Initial Buffer Size

Different Orderings of Buffer Init

Policy Transfer & Finetuning w/o Buffer Init

Comparing Everything Together

Longer Training in Simulation

Sim Environment

For testing and iterating our algorithms, we developed a simulation replica of our real robot setup using Mujoco and dm-control. This simulation model consists of the same 16 DoF 4-fingered DHand attached to a 6 DoF Sawyer robot arm as the one built in the real world.

The simulation task considered here is to reposition the 3-pronged object from anywhere on the tabletop back to the center. In this environment, the robot correctly solving the task corresponds to a ground-truth episode reward of -20.

Longer Training in Simulation

The red line represents the average eval performance of our method across 4 seeds using buffer initialization (UTD=4, 60k transitions initialization, same as real-world), while the brown line represents the average eval performance of the baseline method (AVAIL, UTD=1) w/o buffer initialization across 4 seeds. Both lines are smoothed using an EMA of 0.9. Our method is notably more sample efficient at solving the task than the baseline method and is more stable at convergence than the baseline when trained up to 500k steps.