Publication
Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning [arXiv]
Houjian Yu, Zheming Zhou, Min Sun, Omid Ghasemalizadeh, Yuyin Sun, Cheng-Hao Kuo, Arnie Sen, and Changhyun Choi
Abstract
Enabling robots to grasp objects specified by natural language is crucial for effective human–robot interaction, yet remains a significant challenge. Existing approaches often struggle with open–form language expressions and assume unambiguous target objects without duplicates. Moreover, they frequently depend on costly, dense pixel–wise annotations for both object grounding and grasp configuration. We present Attribute–based Object Grounding and Robotic Grasping (OGRG), a novel model that interprets open–form language expressions and performs spatial reasoning to ground targets and predict planar grasp poses, even in scenes with duplicated objects. We investigate OGRG in two settings: (1) Referring Grasp Synthesis (RGS) under pixel–wise full supervision, and (2) Referring Grasp Affordance (RGA) using weakly supervised learning that requires only single–pixel grasp annotations. Key contributions include a bi–directional vision–language fusion module and the integration of depth information for improved geometric reasoning, enhancing both grounding and grasping performance. Experiment results demonstrate superior performance over strong baselines in tabletop scenes with varied spatial language instructions. For RGS, OGRG operates at 17.59 FPS on a single NVIDIA RTX 2080 Ti GPU, enabling potential use in closed–loop or multi–object sequential grasping, while delivering higher grounding and grasp–prediction accuracy than all baseline methods. Under the weakly supervised RGA setting, the model likewise surpasses baseline grasp–success rates in both simulation and real–robot trials, underscoring the effectiveness of its spatial–reasoning design.
The model is designed to solve the attribute-based grounding and grasp detection task. The RGS subtask aims at predicting grasp rectangles with pixel-wise full supervision. The RGA subtask focuses on predicting grasp affordances with weak grasping supervision.