FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models

Abstract: Task-oriented grasping (TOG), which refers to the problem of synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the complex relationship between objects, tasks, and grasps requires rich semantic and geometric knowledge about objects and tasks. Existing methods typically limit the knowledge to a closed-set and cannot support the generalization to novel objects and tasks out of the training set. To address such a challenge, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Comprehensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoudationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoudationGrasp is validated in physical grasping and manipulation experiments on a 7 DoF robotic arm.

Video

Pipeline

(Left) An overview of FoundationGrasp framework:  The pipeline consists of (1) data generation, (2) multi-modal feature representation, and (3) task-oriented grasp evaluation. When presented with an object, a task, and a language instruction, FoundationGrasp first prompts an LLM to generate semantic and geometric descriptions of the object and the task. A web-based image retrieval module crowdsources images from the Internet. Subsequently, an LLM-based semantic knowledge encoder and a VLM-based geometric knowledge encoder transform them with multi-modal sensory inputs into their feature representations. In the final stage, a Transformer-based task-oriented grasp evaluator with semantic and geometric branches evaluates the task compatibility of each grasp candidate

(Right) The detailed structure of task-oriented grasp evaluator: Task-oriented grasp evaluator is a customized Transformer, consisting of a geoemtric branch (left) and a semantic branch (right).

Perception Experiments (LaViA-TaskGrasp Dataset)

Qualitative results of perception experiments: FoundationGrasp, GraspGPT and GCNGrasp are evaluated under held-out instance, class, and task settings on the proposed LaViA-TaskGrasp dataset. Grasp poses are colored by their confidence scores (green is higher). FoundationGrasp can predict accurate task-oriented grasp poses with high confidence scores.

Real-Robot Experiments (Task-Oriented Grasping) 

"Fetch the book to handover"

"Wash the tub using this scrubber"

"Squeeze the yellow mustard out from the bottle"

"Help me pound a nail using a hammer"

"If you want to poke, hold the scissors"

"Perform drinking on the milk tea cup"

"Pour water from the mug"

"Clean the table with this brush"

"In order to make soup, boil water with the pot first"

"Show me how to wear hair claw"

"Dispense the soup with a ladle"

"Hold the screwdriver to remove screws"

"Drink with the Coke can"

"Can you hang the headphones to the rack please"

"To drink tea, pour from the teapot"

"Using the pliers for straightening"

"Grasp the spatula in a cooking-friendly way"

"Use the knife the gently spread the butter on the bread"

Real-Robot Experiments (Task-Oriented Manipulation

"Scoop the coffee bean with a spoon"

"Could you add some ketchup to my dishes"

"Scoop coffee beans from that container"

"Pour from the mug to the pot "

"Remove stains from dishes with a scrub brush "

"Use this bottle brush to clean the mug"

"Handover the knife on the table to me"

"Use the hammer to pound the nail"

"Could you help me open the ketchup bottle"

Authors

Citation

TBA