Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance
Kelvin Xu1*, Zheyuan Hu1*, Ria Doshi1, Aaron Rovinsky1, Vikash Kumar2, Abhishek Gupta3, Sergey Levine1
1 UC Berkeley, 2 Meta AI Research, 3 University of Washington
* Equal Contribution

Accepted at ICRA 2023

Abstract
Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which not only allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.

TL;DR

We introduce a new system that allows robots to learn tasks through interaction without manual engineering, using a framework for users to define tasks with image examples and reinforcement learning. The system was tested with a four-finger robot hand learning object manipulation tasks in the real world.

Our Method & Brush Cleaning Task

Physical Setup

Our hardware platform is built to demonstrate dexterous multi-task robot learning. The system consists of a custom designed 16 DoF, 4-fingered robotic hand installed on a 7 DoF Sawyer Robot Arm. The robot operates in a table-top workspace. The robot is also equipped with a top view webcam, and a side view webcam or fish-eye lens camera installed under the palm depending on the task. We leverage these visual observations from different angles to enable learning from images with minimal additional instrumentation. The robot is able to perform training continuously for 30+ hours (the entirety of tasks) without human intervention. 

The objects used in our manipulation task are a 3-D printed pipe and hook that was custom designed. All the objects are manipulated in an arena of overall size 33" x 33" consisting of a base of 20" x 20" and 8" x 8" panels.

Fish-eye Lens Palm Camera

This camera is installed to enable in-hand brush rotation with the palm facing downward. This camera effectively acts as a means to allow to robot detect contact and determine if rotation is successful.

Colored Fingers

The fingertips are covered with colored "skin" layers that provide more friction for contact-rich tasks. 

Hose Connector (Pipe) Object

Hook Object

Dish Brush

Kitchen Sink

Training Footages and Final Behaviors

Hook Task Milestone Collection & Training Video


Hose Insertion Task Milestone Collection & Training Video


Task Descriptions and Example Goal Images

Hose Task

The goal of this task is to have the hand reach, grasp, reorient, and insert a hose connector into an insertion point. The hose connector is attached to a fixed point at the top of the 20in × 20in the arena by a rope that is 31cm long. For this environment, we collected 300 milestone images (84 × 84 × 6) for each subtask. We apply data augmentation using a random crop on the provided milestone images in addition to a randomly sampled N(0, 0.02) noise vector on the state.


Hook Task

This task is designed to have the hand grasp, reorient, hook, and unhook a carabiner hook onto a latch. The hook is attached to a fixed point at the top of the arena by a rope that is 31cm long. For this environment, we also collect 300 milestone images (84 × 84 × 6) for each subtask. We apply data augmentation using a random crop on the provided milestone images in addition to a randomly sampled N(0, 0.02) noise vector on the state.


Brush Task

This brushing task is to simulate a real-world kitchen environment. The robot needs to grasp the brush, scrape the plate with the plastic end, rotate the brush 180 degrees in the hand, and clean the plate. The brush is attached to a fixed point at the top of the arena by a rope that is 31cm long. We use the same reward learning setup as in the other two experiments.


Real World Comparison with Forward-Backward (Eysenbach et al. 2017)

In real world setting, we experiment and evaluate both our method and the Forward-Backward (Eysenbach et al. 2018) method. For the Forward-Backward method, we remove the training of the reach task by scripting it, making the learning problem easier. Additionally, we script the reach subtask, combine grasp and flipup into one forward task with horizon T=100 (50 for each task) and 600 milestone images, and train insertion as the backward task. Even when making the learning problem easier for the Forward-Backward method setting by scripting the reach subtask, our method outperforms the Forward-Backward method. The discrepancies in the performance of the algorithms demonstrate the improvements in learning capacity as the granularity of the division of tasks increases. As the forward-backward controller lacks an explicit task supervision, for fairness we compare to heuristic based task graph.

For the Forward-Backward method, we combine grasp, flipup, and hook into one forward task with horizon T=200, (50 for grasp and flipup, 100 for hook) and 900 milestone images, and train unhook as the backward task. Once again, since our method outperforms the Forward-Backward method, the results highlight the improvements in learning capacity as the granularity of the division of tasks increases.

DhandValve3-v0 Environment (Simulation)

The DHandValve3 environment contains a green, three-pronged valve placed on top of a square arena with dimensions 0.55m x 0.55m. The valve contains a circular center, and each of its prongs has an equal length of 0.1m. There are three phases in this task: reach, reposition, and pickup. The reach phase's success criteria is when the hand is within 0.1m from the valve. In the reposition phase, the success criteria for the hand is to reach for the valve, grasp the valve with its fingers, and drag the valve to within 0.1m of the arena center. Finally, the success of the pickup phase is measured if the object is picked up and brought within 0.1 m of the target location which is 0.2m above the table. In order to prevent the object from falling off the table, the object is constrained to a 0.15m radius by a string from the center of the table. 

The observation space of the environment is two camera views of the robot, resized to 84x84x3. In addition, the proprioceptive state of the arm is provided, which is comprised of a 16-dim hand joint position, a 7-dim Sawyer arm state, and a 6-dim vector representing the end-effector position and euler angle. We assume no access to a ground truth reward function, nor to episodic resets. The labels in the bottom left corners were overlayed for visualization purposes. The time horizon we use in each phase of the environment is T=100. On the left, we provide 300 goal images per phase, which is comparable to the number used in prior work. We also provided the oracle task graph for the valve environment, which we use to evaluate the relative performance of our learned task graph model.

Experiment Details

In this section, we describe details related to our RL learning algorithms and also provide hyperparameters for each method. For completeness, here we describe our procedure for performing image-based RL, which, as noted in prior work, presents significant optimization challenges (Laskin et al. 2020, Kostrikov et al. 2020).

In order to make learning more practical, we make use of a combination of data augmentation techniques during training, which has been previously shown to improve image-based reinforcement learning (Kostrikov et al. 2020), and dropout regularization (Hiraoka et al. 2021). For all approaches, we evaluate on in this work we make use of random shift perturbations, which pad the image observation with boundary pixels before taking a random crop.

We also describe the individual prior methods we compare in detail for the purpose of reproducibility. For shared parameters, we summarize them below and provide baseline-specific parameters separately.

Learned Task Graph Comparisons

Simulated Analysis

A comparison of success rate on the simulated DhandValve3-v0 domain compared to an oracle task graph (which uses privileged state information). We find that performance is comparable across approaches, indicating that our framework is robust to differences in task graph construction at convergence. These results suggest however, that additional sample complexity gains could be realized through more improved task graph construction. 

References
[1] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace: Learning to reset for safe and autonomous reinforcement learning,” arXiv preprint arXiv:1711.06782, 2017.
[2] I. Kostrikov, D. Yarats, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” arXiv preprint arXiv:2004.13649, 2020.
[3] T. Hiraoka, T. Imagawa, T. Hashimoto, T. Onishi, and Y. Tsuruoka, “Dropout q-functions for doubly efficient reinforcement learning,” arXiv preprint arXiv:2110.02034, 2021.
[3] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” arXiv preprint arXiv:2004.14990, 2020.