Introductory Video (with audio)
Publication
Interactive Robotic Grasping with Attribute-Guided Disambiguation [arXiv]
Yang Yang, Xibai Lou, and Changhyun Choi
BibTex
@inproceedings{yang2022interactive,
title={Interactive Robotic Grasping with Attribute-Guided Disambiguation},
author={Yang, Yang and Lou, Xibai and Choi, Changhyun},
booktitle={2022 IEEE International Conference on Robotics and Automation (ICRA)},
year={2022},
organization={IEEE}
}
Abstract
Interactive robotic grasping using natural language is one of the most fundamental tasks in human-robot interaction. However, language can be a source of ambiguity, particularly when there are ambiguous visual or linguistic contents. This paper investigates the use of object attributes in disambiguation and develops an interactive grasping system capable of effectively resolving ambiguities via dialogues. Our approach first predicts target scores and attribute scores through vision-and-language grounding. To handle ambiguous objects and commands, we propose an attribute-guided formulation of the partially observable Markov decision process (Attr-POMDP) for disambiguation. The Attr-POMDP utilizes target and attribute scores as the observation model to calculate the expected return of an attribute-based (e.g., ``what is the color of the target, red or green?'') or a pointing-based (e.g., ``do you mean this one?'') question. Our disambiguation module runs in real time on a real robot, and the interactive grasping system achieves a 91.43% selection accuracy in the real-robot experiments, outperforming several baselines by large margins.
Figure 1: Overview. If a user wants the green apple but gives an ambiguous command, the red apple and green pear will cause target matching to fail. In our system, our object grounding module grounds each candidate object (detected by the object detector) to predict their matching score with the query language and the attributes (color and location). Using the grounding results as the observation model, the attribute-guided POMDP planner calculates the expected return of each asking action a^q and grasping action a^g. The robot effectively resolves the ambiguity by asking attribute questions (about the color and location of the object) and pointing questions.