Grasp as You Dream: Imitating Functional Grasping from
Generated Human Demonstrations
Grasp as You Dream: Imitating Functional Grasping from
Generated Human Demonstrations
Abstract: Building generalist robots capable of performing functional grasping in everyday, open-world environments remains a significant challenge due to the vast diversity of objects and tasks. Existing methods are either constrained to narrow object/task sets or rely on prohibitively large-scale data collection to capture real-world variability. In this work, we present an alternative approach, GraspDreamer, a method that leverages human demonstrations synthesized by visual generative models (VGMs) (e.g., video generation models) to enable zero-shot functional grasping without labor-intensive data collection. The key idea is that VGMs pre-trained on internet-scale human data implicitly encode generalized priors about how humans interact with the physical world, which can be combined with embodiment-specific action optimization to enable functional grasping with minimal effort. Extensive experiments on the public benchmarks with different robot hands demonstrate the superior data efficiency and generalization performance of GraspDreamer compared to previous methods. Real-world evaluations further validate the effectiveness on real robots. Additionally, we showcase that GraspDreamer can (1) be naturally extended to downstream manipulation tasks, and (2) can generate data to support visuomotor policy learning.
Overview
Why?
Building generalist robots capable of performing functional grasping in everyday, open-world environments remains a significant challenge.
Gap...
Existing methods are either constrained to narrow object/task sets or rely on prohibitively large-scale data collection to capture real-world variability.
Aha!
VGMs pre-trained on internet-scale human data implicitly encode generalized priors about how humans interact with the physical world.
How:
GraspDreamer is a method that leverages human demonstrations synthesized by VGMs to enable zero-shot functional grasping without labor-intensive data collection.
Video Presentation
Pipeline
An overview of GraspDreamer. The pipeline consists of three stages: (a) Human demonstration generation with VGMs, (b) Human hand motion extraction and optimization, and (c) Human-to-Robot functional retargeting and execution.
Real-Robot Experiments: Allegro Hand
Real-Robot Experiments: Parallel-jaw gripper
Citation
TBA