PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Dingkun Guo, Yuqi Xiang, Shuqi Zhao, Xinghao Zhu, Masayoshi Tomizuka, Mingyu Ding†, Wei Zhan

Mechanical Engineering, University of California, Berkeley

*These authors contribute equally to this work. †Corresponding author and project lead.

Abstract

Robotic grasping is a fundamental aspect of robot functionality, defining how robots interact with objects. Despite substantial progress, its generalizability to counter-intuitive or long-tailed scenarios, such as objects with uncommon materials or shapes, remains a challenge. In contrast, humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before. This work delves into infusing such physical commonsense reasoning into robotic manipulation. We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds, seamlessly integrated through a bridge module. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical attributes on grasping, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal positions and angles for grasping. Additionally, its language comprehension enables it to interpret human instructions, facilitating the output of grasping poses aligned with human preferences. For training PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties, alongside their corresponding language descriptions of physical properties and human preferences. Extensive experiments conducted in both simulators and real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement over GraspNet.

Real-World Experiment

demo_hammer.mp4

demo_bottle.mp4

demo_banana.mp4

Presentation

Qualitative Results

Visualizations of the affordance map and grasping pair match map for our method. The left column is the affordance map of the analytical method (ground truth), the middle is our affordance map, and the right is the grasping pair match map. We observe that our affordance map prediction exhibits high quality and closely resembles the ground truth. In the match map, yellow intensity indicates the matching confidence, with red and yellow points representing an anchor and its top-1 matching pair.

Documents

Citation

Please cite this work if it helps your research:

@ARTICLE{Guo2024PhyGrasp,

title={PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models},

author={Dingkun Guo, Yuqi Xiang, Shuqi Zhao, Xinghao Zhu, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan},

year={2024},

eprint={2402.16836},

archivePrefix={arXiv},

primaryClass={cs.RO}

}

Contact

Send questions or comments to Dingkun Guo (edu at dkguo dot com), Yuqi Xiang (xyq87121119 at gmail dot com), and Mingyu Ding (myding at berkeley dot edu).