RTAGrasp: Learning Task-Oriented Grasping from Human Videos via
Retrieval, Transfer, and Alignment
RTAGrasp: Learning Task-Oriented Grasping from Human Videos via
Retrieval, Transfer, and Alignment
Wenlong Dong1,2, Dehao Huang1,2, Jiangshan Liu2, Chao Tang1,2, Hong Zhang1,2
[1] Shenzhen Key Laboratory of Robotics and Computer Vision, SUSTech, Shenzhen, China.
[2] Department of Electronic and Electrical Engineering, SUSTech, Shenzhen, China.
Abstract: Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual grasp annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show competitive performance in both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm.
Robots learn TOG skills from human demonstration videos through Retrieval, Transfer, and Alignment.
We conducted extensive experiments on the Kinova Gen3 robotic arm。
Pipeline:
Overview: the pipeline first utilizes (a) a retrieval module to retrieve the optimal candidate experience from the memory. Next, it uses (b) a transfer module to transfer the retrieved TOG constraints to the target object to obtain the TOG position {p}_{B} and the TOG direction {v}_{B}. Finally, (c) an alignment module aligns the transferred TOG constraints to the robot's action.
Qualitative Experiments:
Qualitative results of TOG. Each row is a visualization of the intermediate results for an object in an experimental scene.
Real-Robot Experiments (Grasping):
“Open the bottle.”
“Pour water using a mug.”
“Use a clip to hold the documents.”
“Use the dustpan to clean the desk.”
“Use a hamer to pound the nail.”
“Cook using a pan.”
“Cook using a pan.”
“Peel the apple using a peeler.”
“Use a pizza_cutter to slice the pizza.”
“Hand me the scissors.”
“Use the strainer to scoop out the fruits.”
“Use the strainer to scoop out the fruits.”
Real-Robot Experiments (Manipulation):
“Use the brush to clean the coffee beans.”
“Use the harmer to pound the nail.”
“Stir the milk using the scoop.”
Citation:
@misc{dong2024rtagrasplearningtaskorientedgrasping,
title={RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment},
author={Wenlong Dong and Dehao Huang and Jiangshan Liu and Chao Tang and Hong Zhang},
year={2024},
eprint={2409.16033},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2409.16033},
}