Visually Grounding Language Instruction for

History-Dependent Manipulation

Hyemin Ahn*, Obin Kwon*, Kyungdo Kim, Jaeyeon Jeong, Howoong Jun, Hongjung Lee, Dongheui Lee, Songhwai Oh

* These authors equally contributed to this work.

Code+Data : [GitHub] / Paper : [arXiv] / Sup. Mat.: [PDF]

Abstract

This work emphasizes the importance of a robot's ability to refer to its task history, especially when it executes a series of pick-and-place manipulations by following language instructions given one by one. The advantage of referring to the manipulation history can be categorized into two folds: (1) the language instructions omitting details or using expressions referring to the past can be interpreted, and (2) the visual information of objects occluded by previous manipulations can be inferred. For this, we introduce a history-dependent manipulation task whose objective is to visually ground a series of language instructions for proper pick-and-place manipulations by referring to the past. We also suggest a relevant dataset and model which can be a baseline, and show that our network trained with the proposed dataset can also be applied to the real world based on the CycleGAN.

Real Demonstration

VIDEO_FINAL_SUBMISSION.mp4

Dataset

Our history-dependent manipulation (HDM) task consists of several pick-and-place operations, which are instructed by a real human language. For each pick-and-place operation, our dataset provides synthetic RGB images of the workspace from two viewpoints, a set of language instructions from real humans, bounding boxes, and heatmaps showing the target object's position before and after the manipulation. With our simulator based on the Blender and Python code referred from CLEVR dataset, we collected images reflecting this task scenario.

Each history-dependent manipulation task consists of 3 to 6 pick-and-place operations. And for each pick-and-place task, one to four language instructions are collected from human subjects. In the language annotation phase, the tasks are evenly distributed to the participants so that at least two people can annotate each task. The style of the collected language instructions is different from each human participant, which results in the challenging instruction dataset.

In total, the dataset comprises 300 scenarios of history-dependent manipulation tasks (250 for training and 50 for the test), 1339 x 2 images from the robot and human viewpoints for 1339 pick-and-place operations, and 4642 language instructions. For more details, please refer to our github repository and paper (see links on top).

Network

This shows the structure of the proposed model. It consists of (1) an hourglass network to encode the image feature as well as to decode all information into the desired heatmaps (upper blue zone), (2) a bidirectional LSTM to encode the language feature (lower left yellow zone), and (3) a bidirectional LSTM to encode the history feature (lower right green zone). For more details, please refer to our paper.

Citation

Please cite our paper with below bibtex information.

@inproceedings{HDM:2022,

title={Visually Grounding Lanuage Instruction for History-Dependent Manipulation},

author={Ahn, Hyemin and Kwon, Obin and Kim, Kyungdo and Jeong, Jaeyeon and Jun, Howoong and Lee, Hongjung and Lee, Dongheui and Oh, Songhwai.},

year=2022,

month=may,

booktitle={2022 IEEE International Conference on Robotics and Automation (ICRA)},

event_place ={Philadelphia (PA), USA},

month_numeric=5

}