Human-Robot Communication in Physical and Social Environments

HandMeThat: Human-Robot Communication in Physical and Social Environments

Yanming Wan*, Jiayuan Mao*, and Joshua B. Tenenbaum

[Paper] [GitHub] [SlidesLive] [Poster] (* indicates equal contributions.)


We introduce HandMeThat, a benchmark for a holistic evaluation of instruction understanding and following in physical and social contexts. While previous datasets primarily focused on language grounding and planning, HandMeThat considers the resolution of human instructions with ambiguities based on the physical (object states and relations) and social (human intentions) contexts.

Figure 1: An example HandMeThat task, rendered in images. A robot agent observes a sequence of actions performed by the human (step 1-3), and receives a quest (step 4). The robot needs to interpret the natural language quest based on physical and social context, and select the relevant object from the environment: the knife on the table in this case.

HandMeThat contains a collection of 10,000 episodes of human-robot interactions. In each episode, the robot first observes a trajectory of human actions towards her internal goal. Next, the robot receives a human instruction and should take actions to accomplish the subgoal set through the instruction.

We present a textual interface for our benchmark, where the robot interacts with a physically grounded virtual environment through textual commands. We evaluate several baseline models on HandMeThat, and show that both offline and online reinforcement learning algorithms perform poorly on HandMeThat, suggesting a significant room for future work on physical and social human-robot communications and interactions.


Here we present two pipelines to introduce HandMeThat. One describes what an episode consists of, and the other explains how a piece of data is generated.

Figure 2: (a) A pipeline for HandMeThat task formulation. Stage 1: Human takes T steps from initial state s_0 towards a goal g. At state s_T, she generates a quest m for robot and utters it as u. Stage 2: Robot perceives and acts in the world, following the human's instruction. Evaluation: When the robot stops, we check if human's quest has been satisfied, and count robot's action costs to give a final score.

(b) A pipeline for HandMeThat data generation. We first sample initial state s_0, human’s goal g, and solve a plan for human to execute. At a randomly sampled step T, the human generates a quest, including both the internal m and utterance u.


The code snippet below sets up an episode from HandMeThat.

See Examples Page for demonstrations of how an agent plays the game.


We release the code used to set up the environment for training and evaluating agents in our GitHub repo.

To download the dataset, please visit the Google Drive page.