I Can Tell What I am Doing:
Toward Real-World Natural Language Grounding of Robot Experiences

Zihan Wang, Brian Liang, Varad Dhat, Zander Brumbaugh

Nick Walker, Ranjay Krishna, Maya Cakmak

Abstract

Understanding robot behaviors and experiences through natural language is crucial for developing intelligent and transparent robotic systems. Recent advancement in large language models (LLMs) makes it possible to translate complex, multi-modal robotic experiences into coherent, human-readable narratives. However, grounding real-world robot experiences into natural language is challenging due to many reasons, such as multi-modal nature of data, differing sample rates, and data volume. We introduce RONAR, an LLM-based system that generates natural language narrations from robot experiences, aiding in behavior announcement, failure analysis, and human interaction to recover failure. Evaluated across various scenarios, RONAR outperforms state-of-the-art methods and improves failure recovery efficiency. Our contributions include a multi-modal framework for robot experience narration, a comprehensive real-robot dataset, and empirical evidence of RONAR's effectiveness in enhancing user experience in system transparency and failure analysis.

Real-World Natural Language Grounding of Robot Experiences

RONAR: Our framework for real-world robot narration. It takes in four categories of dynamic inputs and one static input: multimodal environmental observations (E), robot internal states (I), task planner (TP), and specified conditions (C), along with robot specifications (SP). RONAR then uses its LLM-based narration engine to process these inputs and generate narrations based on the specified narration mode. The generated narration can be used to address downstream narration-related tasks.

The RoboNar dataset. It incl-udes four daily housekeeping tasks with real failure cases, containing ground truth fail-ure explanations and recov-ery descriptions labeled by human experts.

Environment (E): Sensory data used to observe the external world, such as RGB images, point clouds, audio, tactile feedback, etc.

Internal (I): Sensory data related to the internal state of the robot, including internal sensors, joint angles, base velocity, battery levels, and other diagnostic information.

Task Planning (TP): High-level planning data that contains overall task objectives, sub-task sequences, execution history, and plan outcomes.

Robot Narration (RONAR) Framework

RONAR: Our framework on real-world robot narration. It has three parts, which are key frame selection, experience summarization and narration generation. It takes in the raw multi-modal robot data stream and outputs a text describing past experiences, current observations, and future plans of the robot.

1. Key Event Selection

2. Experience Summarization

3. Narration Generation

Multi-Modal Key Event Selection

Processing multi-modal sensory inputs from real robot systems is important. When executing robot processes, massive amounts of data are streamed at persistent rates across different sensors on the robot, which can create data that is too repetitive and dense to be interpretable for users. Furthermore, the sampling rates of different robot sensors can vary significantly, creating difficulties in aligning information to be processed together. These challenges necessitate procedures for aligning and sampling valuable information from robots. We can this multi-modal key event selection.

Multi-Sensory Data Alignment. We sample and align the dense, mixed data by dividing the temporal length of the data by a single sample rate s for a sequence of frames f0, f1, . . . , fn. In each frame, we add the robot information across each considered medium, using the information with the timestamp closest to the frame timestamp. The result of this procedure is a sequence of frames separated by a fixed interval that captures information across robot sensors that may have mixed, high-frequency sampling rates.

Key Event Selection with Multi-Modal Inputs. Key events are selected from the aligned data by heuristically monitoring for interesting information across the different data categories. For environmental data, we compute the optical flow of the RGB images to capture the motion dynamics within the scene, using the running sum of average flow magnitudes as a heuristic for computing changes in perception information sufficient for a key event. For the internal state of the robot, the joint states are used as a heuristic for observing changes in robot motion sufficient for a key event. We also add a key event each time there is a state change in the task planner, with the reasoning that state transitions are indications of notable events.

Experience Summarization

With selected key events, the next step is to ground raw robotic data into experience summaries in natural language. Based on the categories of the robot data, the experience summary is also composed of three components: 1) environment summary, 2) internal summary, and 3) planning summary.

Environment Summary. We use YOLO World to conduct open-world object segmentation of the corresponding RGB images. The detected objects are represented by a bounding box coordinate and unique object id, formed an detected object set. Our system leverages depth information to filter out irrelevant objects based on certain distance criteria. These left objects forms an object set for the scene. We have a spatial relation set for the objects, which is defined as {left to, right to, above, below, in front of, behind}. Therefore, we can use object list and spatial relationships to build a scene graph.

Internal Summary. The goal of internal summary is to ground numerical values of part states (e.g. base states, joint states, etc.) to natural language based on the configuration of the robot. Each part of the robot in the configuration has three components: part description, part limit and part type. The robot’s configuration is specified as part of the system prompt for the internal summary generation.

Planning Summary. A planning summary is generated to capture the plan-level status. It summarizes the high-level plan of task execution. It contains the overall task with description, the sequential order of sub-goals, the current sub-goal and a history of sub-goal executions and outcomes. Unlike other methods, We also include a history of sub-goal executions and outcomes.

Narration Generation

The experience summaries ground environmental observations, internal status and task planning of a robot during task execution into natural language with massive details. However, not all of the details are useful for human to understand and react to the robot. We need some ways to abstract the information and only narrate things users care about. Here is the method to generate the narration based on the experience summary and narration history.

Narration Mode. The requirements for narrations can vary significantly depending on the robot’s use cases and the user’s level of expertise. To meet general narration needs, we have defined three narration modes:

Alert Mode: It only narrates the important information to the user which require user’s attention.
Info Mode: It narrates robot experiences in multiple sentences and provide a concise summary of the robot observation, internal status and planning without any numerical values and part names.
Debug Mode: It narrates all the details in environmental observations, robot internal status and planning of the robot. It contains numerical values and attribute names.

Progressive Narration Generation. We consider a generated narration to be good if it has the properties of non-repetition and being smooth. Non-repetition means the narration should not repetitively narrate behaviors which has already been narrated to the user. Being smooth means the transition between narrations should be natural and seamless. In order to create quality narrations, our method use a progressive way to generate narrations. Consider a robot narration history which contains all the narration instances until key event t − 1, Nt−1 = {n0, n1, ..., nt−1}. There is a new key event, kt, has been detected and a experience summary, st has been generated. We input both Nt−1 and st with a specified mode m to an LLM to generate the narration at time t, nt: nt = LLM(Nt−1, st|m).

RoboNar Dataset

We collect a real-world dataset using a Stretch SE3 robot in a home environment. We created four real-world housekeeping tasks in order to capture a wide scenarios of use cases in realy world home envrionment: 1) pick a dirty cup and put it in sink, 2) microwave lunch, 3) hang a hat, and 4) collect dirty clothes. We collected data, including RGB-D observations captured by two cameras, an Intel RealSense d435i and d405, joint readings, base readings, state information, and diagnostics. We also save the processed data, which includes downsampled aligned keyframes. For each demonstration, human experts create ground truth labels for failure timestamps, failure reasons, and recovery instructions. The dataset contains 70 demonstrations and 76 failure cases across navigation, manipulation, and detection.

Put Cup in Sink (P)

The task is performed in a connected kitchen and lounge environment with a dirty cup in the lounge and a sink in the kitchen. The specific sequence of states executed by the robot is as follows: the robot navigates to a table, looks for a cup, picks up a cup from the table, navigates to the sink, looks for the sink, then places it in a sink.

Microwave Lunch (M)

The task is performed in a kitchen environment with a microwave and food near the microwave. The specific sequence of states executed by the robot is as follows: the robot navigates to the microwave, looks for the microwave, opens the microwave door, navigates to the food, looks for the food, picks up the food, navigates to the microwave, looks for the microwave, places the food inside the microwave, then closes the microwave door.

Hang Hat (H)

The task is performed in a lounge environment with a human wearing a hat and a hook. The specific sequence of states executed by the robot is as follows: the robot navigates to the human, is handed a hat from the human, navigates to the hook, looks for the hook, then hangs the hat on the hook.

Collect Dirty Clothes (D)

The task is performed in a lounge environment with a laundry basket and clothes arranged around the room. The specific sequence of states executed by the robot is as follows: the robot navigates to the clothes, looks around for clothes, classifies dirty clothes, picks up dirty clothes, navigates to the laundry basket, looks for the laundry basket, then places the clothes in the laundry basket.

Failure Examples with Human Labels

Experiments and Results

Failure Analysis with Experience Summaries

This experiment aims to demonstrate that RONAR’s generated experience summaries can significantly enhance the failure analysis capabilities of robot systems. We break down the failure analysis problem into four sub-tasks:

Risk Estimation / Failure Prediction (Pred): given previous key events, percentage of predicted failures are the actual failure in the actual failure key event.
Failure Localization (Loc): given the all key events, percentage of predicted failure time are aligned with the ground truth failure time.
Failure Explanation (Exp): given previous key events and current key event (when failure happened), percentage of generated failure explanations are aligned with the ground truth failure explanation.
Recovery Recommendation (Rec): given previous key events and current key event (when failure happened), percentage of recovery recommendation are aligned with the ground truth recovery recommendation.

These sub-tasks include most scenarios the robot systems need to face during the operations. The methods used for comparison are:

BLIP2: use BLIP2 to generate a caption for the RGB image of the key event.
REFLECT: current state of the art LLM-based failure explanation framework.
TEM-LLM: send all raw sensory and planning data directly to the LLMfor failure analysis.
TEM-VLM: send all raw sensory and planning data directly to the VLM for failure analysis. We use GPT-4o as VLM.
RONAR-vision only: our method without internal and planning inputs.
RONAR-no prior: our method only uses current key events for failure analysis.

Narration Quality Evaluation (User Study 1)

Participants rate the generated narration snippet on naturalness, informativeness, coherence, and overall quality on 5 point basis.

Naturalness: Does the narration feel natural and human-like? (1-5)
Informativeness: Does the narration provide useful information about the robot’s behavior? (1-5)

Coherence: Does the narration organize information logically and clearly? (1-5)
Overall: What is your overall assessment of the narration’s quality? (1-5)

Failure Identification by Human with Narration (User Study 2)

This user study is to evaluate the effectiveness and efficiency of narrations on failure identification. The key question we want to answer is how effective and efficient the narration can help users to identify the failures. We design two tasks for the failure identification problem:

Failure time identification: How accurate and efficient can participant correctly identify the time of failure in a demo?
Failure explanation: How accurate and efficient can participant have a reasonable explanation about the failure?

In this study, we design four interfaces which the users will be used to identify the failure time and failure reason. The four failure identifications interfaces are:

RAW-VID: a traditional video player interface which only shows the raw video captured by the robot camera.
RAW-ALL: a video player interface which displays the raw video and all raw sensor readings (both joint and base) from the robot. The sensor readings are visualized by line plots and synchronized with the raw video.
KEYFRAME: the RONAR-UI interface without narration. It includes the raw video, selected keyframes, and state information.
RONAR-UI: the RONAR-UI interface with fully functionalities. It includes both narrations with alert mode and info mode generated RONAR.

Interfaces Used for Failure Identification

Examples and Applications

Narration Examples Generated by RONAR Using Different Modes

It only narrates the important inform-ation to the user which require user’s attention.

It narrates robot experiences in multiple sentences and provide a concise summary of the robot observation, internal status and planning without any numerical values and part names.

It narrates all the details in environmental observa-tions, robot internal status and planning of the robot. It contains numerical values and attribute names.

Narration Examples Generated by Different Methods

BLIP: a kitchen sink filled with lots of clutter

REFLECT: The robot is positioned such that its camera faces both a cup and a sink, indicating it's likely preparing to handle the cup. Based on the description, the robot is also to the right of a bowl, which is further left in relation to the cup and sink. The primary goal seems to involve the robot using teleoperation to possibly place the cup into or near the sink, leveraging its visual observations to navigate the space efficiently.

LLM-TEM: The robot successfully located and picked up the cup, navigating to the sink afterwards. It has successfully navigated to the table, picked up the cup, and moved towards the sink. However, its initial attempt to place the cup in the sink failed. Consequently, the robot is now in a teleoperation mode, seeking assistance to complete the task of placing the cup in the sink properly.

VLM-TEM: The robot successfully navigated to the cup, identified it, and picked it up. It then moved to the sink, located it, but failed at the initial attempt to place the cup. The robot is currently in a teleoperate mode to resolve the placement issue, possibly requiring user intervention for successful completion.

RONAR: The robot is currently in the process of placing the cup in the sink. Facing difficulties in placing the cup, it is now in teleoperation mode to finish the task. The robot’s arm is extended and positioned, while all other components are stable and almost stationary. The robot plans to adjust its arm positioning under user guidance to accurately place the cup into the sink.

BLIP: a microwave sitting on top of a kitchen counter

REFLECT: The robot, equipped with a camera, is positioned such that it is above cabinet_0 and to the right of microwave_0, and also above cabinet_1. From this camera angle, it observes that cabinet_0 is situated directly below the microwave_0 and to the right of another cabinet, cabinet_1, which is also below the microwave. The robot's immediate goal is to open microwave_0.

LLM-TEM: The robot has successfully navigated and positioned itself to open the microwave, with its camera oriented to accurately view the microwave above a cabinet. Currently, the robot's joints are actively engaged, particularly its gripper fingers, as it performs the task of opening the microwave door. It has completed the initial stages of its task sequence as reflected in the state history, advancing through the start, navigating, and detecting stages to reach the microwave successfully. The robot efficiently maintains its orientation and velocity as it operates within the environment.

VLM-LLM: The robot has successfully navigated to the microwave and opened it, as indicated by the successful completion of the 'NAVIGATE_TO_OPEN_MICROWAVE' and 'DETECT_TO_OPEN_MICROWAVE' stages. The robot's joints, particularly the grippers and wrists, are positioned to interact with the microwave, which is consistent with its task of opening the microwave. The odometer readings indicate precise positioning near the microwave, suited for further interactions like placing food inside. The visual observation confirms the robot's gripper is in close proximity to the now-opened microwave, ready for the next steps in its task.

RONAR: The robot is currently in the process of opening the microwave. Observing its environment, there are no other objects detected nearby, ensuring a clear path. The robot's arm is mostly extended and raised high, poised for the next step, while the grip fingers are halfway open and slowly closing. The robot has successfully navigated to the microwave and detected its handle or opening mechanism. The next step is to manipulate the microwave door to open it.

Narration examples generated by RONAR for Failure Identification

Trajectory Summarization and Customized Trajectory Retrieval

The narration generated by RONAR is event-level, which means each of the narration corresponds to a single key event. It only captures a snapshot of the process, but cannot show an overview of the trajectory. Therefore, we create a higher level summarization, trajectory summarization, which is generated by RONAR using the narration history and captures the details of the trajectory in a human readable way. For each demonstration, RONAR can generate trajectory-level summarization by using event-level narrations. This summarization contains the detailed information of the trajectory and users can retrieve trajectories with customized queries. These customized trajectories can be used for further downstream tasks, such as model training and system analysis.

Narration Based System Overview

Not limited to trajectory-level summaries, RONAR can generate summaries in an even higher-level. With a collection of trajectory summaries, RONAR can generate a system-level summary to give users an overview of the robot system. As shown below, users can generate robot system overview by using a collection of trajectory summaries and RONAR. The system overview is customizable based on users’ requirements. In this example, we ask RONAR to generate system overview on failures, recoveries and improvement recommendations based on the experiments. The system overview can give users a big picture of the overall system and assist them to make systematic improvements. Furthermore, users can compare system overviews among different dates of experiments to keep track of the system improvement progress.

RONAR-UI

RONAR-UI Online Mode

RONAR-UI Offline Mode

Appendix: Demographics and Robot Background of Participants

We recruited 24 participants with different background to conduct the user studies on narration quality evaluation and failure identification using narration. We create a questionnaires to ask for participants’ background at the end of the user study. The questionnaires include two parts: demographics and robot background. For the demographics, we include the following questions:

Age: Select your age range (under 18, 18-24, 25-35, 35-44, 45-54, 55-64, over 65)
Education: What is your highest level of education? (High school diploma / GED, Associate degree, Bachelor’s degree, Master’s degree, Doctorate)
Filed: Have you studied or worked in a tech or STEM related field? (Yes or No)

We also create questions on robot familiarity for participants to answer. These questions include:

Expertise Level (Robot): Rate your expertise in robotics (1-5)
Hours Spent (Robot): Estimate how many hours have you worked with real robot? (Never, 0-10h, 10-30h, 30-50h, 50-100h, More than 100h)
Expertise Level (Stretch): Rate your expertise with Hello Robot’s Stretch mobile manipulator (1-5)
Hours Spent (Stretch): Estimate how many hours have you worked with stretch robot? (Never, 0-10h, 10-30h, 30-50h, 50-100h, More than 100h)

Demographics of the participants

Expertise of the participants with robot (left) and Stretch (right)

I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences