Get a Grip: Multi-Finger Grasp Evaluation at Scale Enables Robust Sim-to-Real Transfer

Tyler Ga Wei Lum* 1, Albert H. Li* 2, Preston Culbertson 2, Krishnan Srinivasan 1,

Aaron D. Ames 2, Mac Schwager 1, Jeannette Bohg 1

Conference on Robot Learning (CoRL) 2024

1 Stanford University, 2 California Institute of Technology

(*) equal contribution

Links

Paper 📜

Code & Dataset🧑‍💻💽

Full Video 📹

OpenReview✍️

Data Viz 📈

Tweet 1️⃣🐦

Tweet 2️⃣🐦

Key Takeaway

We show that learned grasp evaluators enable robust real-world dexterous grasping when trained at sufficient scale with perceptual data. We release a dataset of 3.5M labeled grasps and show evaluators trained on it achieve SOTA hardware performance.

Abstract

This work explores conditions under which multi-finger grasping algorithms can attain robust sim-to-real transfer. While numerous large datasets facilitate learning generative models for multi-finger grasping at scale, reliable real-world dexterous grasping remains challenging, with most methods degrading when deployed on hardware. An alternate strategy is to use discriminative grasp evaluation models for grasp selection and refinement, conditioned on real-world sensor measurements. This paradigm has produced state-of-the-art results for vision-based parallel-jaw grasping, but remains underutilized in the multi-finger setting. In this work, we contend that existing datasets and methods have been insufficient for training performant discriminative models for multi-finger grasping. To train grasp evaluators at scale, datasets must provide on the order of millions of grasps, including both positive and negative examples, with corresponding perceptual data resembling measurements at inference time. To that end, we release a new, open-source dataset of 3.5M grasps on 4.3K objects annotated with RGB images, point clouds, and trained NeRFs. Leveraging this dataset, we train multiple vision-based grasp evaluators that outperform both analytic and generative modeling-based baselines without evaluators on extensive simulated and real-world trials across a diverse range of objects. We show via numerous ablations that the key factor for performance is indeed the evaluators, and that their quality degrades as the dataset shrinks, demonstrating the importance of our new dataset.

Multi-Finger Grasps Across Diverse Objects [1x Speed]

Goblet

Lunchbox

Redmug

Minion

Goggles

Dragon

Bunny

Dino

Squirrel

Grasping Pipeline on Minion [1x Speed]

1. Collect Multiple Object Views

2. Execute Planned Grasp

Full Video

Our Dataset

Our dataset supplies the components for robust sim-to-real transfer. For each object in the dataset, we provide RGB images of it from several random views, a full point cloud, and trained NeRF weights, which allows our data to be compatible with most vision-based methods. We also supply hundreds of grasps per object, each of which is simulated in Isaac Gym multiple times with slight wrist pose perturbations. Averaging, this yields three smooth regression targets indicating the probability of (1) unwanted collisions, (2) simulated pick success, and (3) grasp success, the logical conjunction of (2) and (3).

Looking at prior works, we see a clear trend toward larger multi-fingered grasp datasets used for training grasp evaluators. Our full training set contains 3.26M grasps.

Simulated Grasp Evaluation

We evaluate grasp quality by simulating the grasp in Isaac Gym. In contrast with DexGraspNet, we check for feasibility in a tabletop setting rather than spawning the scene in a floating environment. We found this assisted sim-to-real transfer by accounting for table penetrations. Moreover, spawning the hand in a pre-grasp (rather than in-contact) position helped eliminate non-physical contact modeling behavior in Isaac Gym that led to instabilities in the simulation labels.

Snowman

Figurine

Grasp Planning

Our grasp planners consist of a sampler and an evaluator. First, the sampler generates a batch of grasp candidates. The evaluator ranks them, and the top K grasps are refined using sampling-based optimization, where the objective is given by the evaluator.

Object Representations

We explore multiple object representations in our experiments.

NeRF

We visualize a comparison between NeRFs trained on simulated snowman data and real-world dragon data, which demonstrates the qualitative similarity between the two NeRFs. (First row): Two of the RGB images used for NeRF training. (Second row): Camera poses used for NeRF training. (Third row): NeRF-rendered RGB image and accumulation image.

Point Cloud and Basis Point Set Object Representation

We show a visualization of a point cloud (5000 black points) and a basis point set (4096 points colored by distance to the point cloud) generated by a NeRF for both a simulated snowman object (left) and a real-world dragon object (right).

Mesh Object Representation

We use NeRFs to generate meshes for two purposes: analytic grasp planning methods (FRoGGeR) and collision-free motion planning from the start pose to the pre-grasp pose (all methods). Left: Comparison between the ground-truth mesh and NeRF-generated mesh of a simulated snowman object, where the extra vertices at the bottom are due to artifacts from shadows/lighting effects. Right: Comparison between an RGB image and the NeRF-generated mesh of a real-world dragon object. We do not have a ground-truth mesh for the real-world dragon object.

BibTeX

@misc{lum2024gripmultifingergraspevaluation,

title={Get a Grip: Multi-Finger Grasp Evaluation at Scale Enables Robust Sim-to-Real Transfer},

author={Tyler Ga Wei Lum and Albert H. Li and Preston Culbertson and Krishnan Srinivasan and Aaron D. Ames and Mac Schwager and Jeannette Bohg},

year={2024},

eprint={2410.23701},

archivePrefix={arXiv},

primaryClass={cs.RO},

url={https://arxiv.org/abs/2410.23701},

}

Page updated

Google Sites

Report abuse