FoundationGrasp

FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models

(T-ASE 2025)

Paper

Video

Code & Data

Appendix

Abstract: Task-oriented grasping (TOG), which refers to the problem of synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the complex relationship between objects, tasks, and grasps requires rich semantic and geometric knowledge about objects and tasks. Existing methods typically limit the knowledge to a closed-set and cannot support the generalization to novel objects and tasks out of the training set. To address such a challenge, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Comprehensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoudationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoudationGrasp is validated in physical grasping and manipulation experiments on a 7 DoF robotic arm.

Video

Pipeline

(Left) An overview of FoundationGrasp framework: The pipeline consists of (1) data generation, (2) multi-modal feature representation, and (3) task-oriented grasp evaluation. When presented with an object, a task, and a language instruction, FoundationGrasp first prompts an LLM to generate semantic and geometric descriptions of the object and the task. A web-based image retrieval module crowdsources images from the Internet. Subsequently, an LLM-based semantic knowledge encoder and a VLM-based geometric knowledge encoder transform them with multi-modal sensory inputs into their feature representations. In the final stage, a Transformer-based task-oriented grasp evaluator with semantic and geometric branches evaluates the task compatibility of each grasp candidate

(Right) The detailed structure of task-oriented grasp evaluator: Task-oriented grasp evaluator is a customized Transformer, consisting of a geoemtric branch (left) and a semantic branch (right).

Perception Experiments (LaViA-TaskGrasp Dataset)

Qualitative results of perception experiments: FoundationGrasp, GraspGPT and GCNGrasp are evaluated under held-out instance, class, and task settings on the proposed LaViA-TaskGrasp dataset. Grasp poses are colored by their confidence scores (green is higher). FoundationGrasp can predict accurate task-oriented grasp poses with high confidence scores.

Real-Robot Experiments (Task-Oriented Grasping)