Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity

Wenxuan Zhou, David Held

Robotics Institute, Carnegie Mellon University

Conference on Robot Learning (CoRL) 2022 (Oral)

[Press Coverage] IEEE Spectrum: Robots Grip Better When They Grip Smarter

Abstract

A simple gripper can solve more complex manipulation tasks if it can utilize the external environment such as pushing the object against the table or a vertical wall, known as "Extrinsic Dexterity." Previous work in extrinsic dexterity usually has careful assumptions about contacts which impose restrictions on robot design, robot motions, and the variations of the physical parameters. In this work, we develop a system based on reinforcement learning (RL) to address these limitations. We study the task of “Occluded Grasping” which aims to grasp the object in configurations that are initially occluded; the robot needs to move the object into a configuration from which these grasps can be achieved. We present a system with model-free RL that successfully achieves this task using a simple gripper with extrinsic dexterity. The policy learns emergent behaviors of pushing the object against the wall to rotate and then grasp it without additional reward terms on extrinsic dexterity. We discuss important components of the system including the design of the RL problem, multi-grasp training and selection, and policy generalization with automatic curriculum. Most importantly, the policy trained in simulation is zero-shot transferred to a physical robot. It demonstrates dynamic and contact-rich motions with a simple gripper that generalizes across objects with various size, density, surface friction, and shape with a 78% success rate.

Occluded grasping (object starts close to the wall)

Occluded grasping (object starts far from the wall)

Recovery Behaviors

We observe that the trained policies are able to recover from uncertainties during execution. In the left video, the object is supposed to drop on the right finger at 0:10 (see more explanations on the typical behaviors in Section 1.1 below). However, the object unexpectedly bounces on the finger and then stands upright. The policy is able to follow the new pose of the object and finish the task. In the middle and right videos, the initial attempt of rotating the object failed potentially due to the noise in the object pose or the uncertainty of the object surface friction. The policy is able to try again and finish the task within the same episode.

1 - Real Robot Experiments

For real robot experiments, we evaluate three policies trained in the simulation: (1) a policy trained with ADR (2) a policy trained on fixed physical parameters without ADR (3) a policy trained with ADR and finetuned with a wider range of initial object location. We evaluate the performance of these policies over 10 test cases with 10 episodes per test case. The quantitative results are included in the main paper and attached below as a reference. In this section, we first show the evaluation videos for each type of policy. Then we discuss the failure cases and recovery behaviors. In the end, we include visualizations of ICP for estimating the object pose. All of the videos below are in x2 speed.

1.1 Policy with ADR Full Evaluation

The following videos are the complete evaluation on the policy trained with ADR. Each video contains 1 test case with 10 episodes. The object-id and the success rate are included in the caption. We would like to emphasize that previous work has not shown such complexity of contact events and object generalization at the same time on a robot with a simple hand.

Two strategies: As discussed in Section 5.5 in the paper, there are generally two typical emergent strategies to finish this task: For both strategies, the robot first pushes the object against the wall and rotates it. In the "dropping" strategy, the robot lets the object drop on the right finger to catch it (see Box-0 episode 1,2,8,9,10 for examples). In the "standing" strategy, the robot rotates the object until it stands on its side and then reach the grasp (see Box-0 episode 4,5,6,7). One single policy network might demonstrate both strategies as shown in the videos below. Within the successful episodes of this policy, 32/78 are using the dropping strategy and 46/78 are using the standing strategy.

Surface friction: We want to highlight that the variation of surface friction places an important role in this task. For example, Box-2 has very different surface friction than the other boxes - we observe that Box-2 sometimes makes sticking contact with the wall at the beginning of the episode and then drop on the ground (see Box-2 episode 4 ,5,6,7,8). This leads to a different object pose distribution from Box-0 which mostly slides along the wall. As we will see in the next section, Box-2 is very challenging for the policy without ADR.

Non-box objects: The bottle and container test cases have very different shapes than boxes. The evaluations with these objects demonstrate out-of-distribution generalization on shapes with a policy that is trained only on boxes. We expect the policy works on non-box objects to the extent that the distribution of the object pose remains similar - Just as boxes, these objects are rotated until they stand on the side or leaning against the wall. We do not expect the policy to work with shapes such as a cylinder which will have a drastically different pose distribution when the robot interacts with it - we leave this to future work.

If you cannot play the videos, such as in chrome incognito mode, you may need to allow the third party cookies. See here for more information.

ADR-Box-0.mov

Box-0 (9/10)

ADR-Box-1.mov

Box-1 (8/10)

ADR-bag.mov

Toy Bag (7/10)

ADR-Box-0-4-erasers.mov

Box-0 with 4 erasers inside (10/10)

ADR-Box-2.mov

Box-2 (9/10)

ADR-bottle.mov

Bottle (8/10)

ADR-container-reverse.mov

Container-reverse (6/10)

ADR-Box-0-8-erasers.mov

Box-0 with 8 erasers inside (4/10)

ADR-Box-3.mov

Box-3 (7/10)

ADR-container.mov

Container (10/10)

1.2 Policy without ADR Full Evaluation

The following videos are the evaluations for a policy trained without ADR. This policy relies on the "standing strategy" much more often. Although the policy is trained on a fixed simulation environment, it still demonstrates generalization across object variations to some extent due to the nature of a closed-loop policy. Overall, the success rate is 45% lower than the policy trained with ADR shown above.

woADR-Box-0.mov

Box-0 (9/10)

woADR-Box-1.mov

Box-1 (5/10)

woADR-bag.mov

Toy Bag (8/10)

woADR-Box-0-4-erasers.mov

Box-0 with 4 erasers inside (6/10)

woADR-Box-2.mov

Box-2 (2/10)

woADR-bottle.mov

Bottle (0/10)

woADR-container-reverse.mov

Container-reverse (0/10)

woADR-Box-0-8-erasers.mov

Box-0 with 8 erasers inside (3/10)

woADR-Box-3.mov

Box-3 (0/10)

woADR-container.mov

Container (0/10)

1.3 Funetune ADR policy with Initial Object Location

To show the feasibility of achieving the task when the object is not close to the wall, we finetune the policy used in the real robot experiments above with ADR on an increasing range of initial object location. The resulting policies have two behaviors across different random seeds when facing an object at the center of the table: (1) using the top finger to push the object from the side and then following the "dropping" strategy or (2) placing both fingers on the objects when pushing and then following the "standing" strategy. We zero-shot transferred the finetuned policies to the real robot and include the sim and real video pairs below for both behaviors.

initial-location-left-finger-sim.mov

(a) Strategy-1 - sim

initial-location-left-finger.mov

(b) Strategy-1 - real

initial-location-right-finger-sim.mov

(c) Strategy-2 - sim

initial-location-right-finger.mov

(d) Strategy-2 - real

We further evaluate one of the above finetuned policies with the full set of test objects. The policy achieves an overall success rate of 56%. When the object is far from the wall, it is more challenging for the non-box objects.

finetune-box-0.mp4

Box-0 (7/10)

finetune-box-0-4-erasers.mp4

Box-0 with 4 erasers inside (7/10)

finetune-box-0-8-erasers.mp4

Box-0 with 8 erasers inside (5/10)

finetune-box-1.mp4

Box-1 (8/10)

finetune-box-2.mp4

Box-2 (9/10)

finetune-box-3.mp4

Box-3 (9/10)

finetune-toybag.mp4

Toy Bag (9/10)

finetune-largebottle.mp4

Bottle (1/10)

finetune-container.mp4

Container (1/10)

finetune-container-reverse.mp4

Container-reverse (1/10)

1.5 Failure Cases

Here are the failure cases from the real robot evaluations above.

A failure case that happens before the initial contact:

  • Missing initial contact: The robot is not able to reach the initial contact of the object to rotate it. This is mostly due to the noise in pose estimation and the variations in object dimension.

Failure cases that happen during the rotation:

  • Object drops during rotation: The object drops to the table during rotation. One potential reason for this failure case is the insufficient delta rotation of the low-level controller due to the sim2real gap. In the "dropping" strategy, the policy is supposed to rotate object and then let it drop on the bottom finger. Before the dropping happens, the gripper needs to be rotated until the bottom finger is below the object. Otherwise, the bottom finger will not be able to catch the object. Another potential reason for this failure case is that the finger slips on the object.

  • Repeated rotation: The robot repeatedly rotates and drops the object. This is different from the previous failure case because the robot moves down with the object at the same time. Our hypothesis for this failure case is that the policy gets stuck in a loop in the MDP.

  • Joint limit: The robot hits a joint limit and the policy gets stuck at the joint limit.

Failure cases that happen after the rotation:

  • Unexpected object dynamics: When the robot rotates the object, the object might move in unexpected ways. This mostly happens for the non-box objects.

  • Stop reaching: Following the "standing" strategy, the policy successfully rotates the object to a stable pose on the side of the object. However, it cannot reach the final grasping pose. The gripper tries to move down to reach the pose but it collides with the object due to the unexpected object dimension.

  • Timeout: Since we use a fixed episode length during evaluation, sometimes the policy does not have enough time to finish the task although it is very close to a success. This happens when the policy spends time to recover from some failed attempts at the beginning of the episode.

We include the counts of each failure cases and their video example below.

Counts of failure cases for each policy

initial_contact.mov

Missing initial contact

joint_limit.mov

Joint limit

object_drop.mov

Object drops during rotation

object_dynamics.mov

Unexpected object dynamics

timeout.mov

Timeout

repeat.mov

Repeated rotation

stop.mov

Stop reaching

1.6 ICP visualization

We use Iterative Closest Point (ICP) to estimate the pose of the object to be used as policy input. The following video shows the result of ICP across an episode of the real robot execution. We assume a given template point cloud of the object. For boxes, we construct the template by measuring the dimensions. For non-box objects, we use scanned models. We use ICP to align the template (red) to the observed partial point cloud of the scene (blue) from the RGB-D camera which includes both the gripper and the box. From the video, ICP is able to match the template to the observed point cloud reasonably well. One failure case of ICP is shown in video(d) below. When the object suddenly drops during execution as a result of policy failure, the object pose becomes too different from the previous timestep. This may create difficulties for ICP and may require a global registration algorithm.

icp_Box-0.mov

(a) ICP with a box template

icp_container.mov

(b) ICP with a scanned model of the container

icp_bottle.mov

(c) ICP with a scanned model of the bottle

icp_Box-0_failure.mov

(d) A failure case of ICP

3 - Visualization of the Policies in Simulation

Here are the visualizations of the policies in simulation. Video (a) shows a typical strategy for the occluded grasping task which rotates the object against the wall and let the object drop on the bottom finger to catch it. Video (b) shows an alternative strategy which pushes the object until the object stands on its side. Video (c) and (d) are trained to perform the grasping task at a distribution of grasp configurations which demonstrate versatile contact-rich behavior learned by the policy.

single_grasp_crop.mp4

(a) Single Grasp

single_grasp_stand_crop.mp4

(b) Single Grasp - Stand

multigrasp-front_crop.mp4

(c) MultiGrasp -Front

multigrasp-side_crop.mov

(d) MultiGrasp -Side

4 - Ablations

Video (a) below shows the trained policy without the wall of the bin. The agent is not able to find a strategy to perform this task. This ablation shows the importance of extrinsic dexterity for this task. Video (b) is trained without the occlusion penalty. The resulting policy is more likely to get stuck at local optima because it is only encouraged to align the position and the orientation of the desired grasp configuration as fast as possible. However, approaching the object in this way will not successfully achieve the task because the gripper cannot move under the object after it rotates the object due to the physical limitation.

without_wall.mov

(a) Without Wall

local_optima1.mov

(c) Without Occlusion Penalty (Local optima)

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1849154 and LG Electronics. We thank Daniel Seita, Thomas Weng, Tao Chen, Homanga Bharadhwaj, and Chris Paxton for the valuable feedback. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.