Grounding Language Plans in Demonstrations

through Counter-factual Perturbations

Yanwei Wang, Tsun-Hsuan Wang, Jiayuan Mao, Michael Hagenow, Julie Shah

ICLR 2024 Spotlight, paper https://arxiv.org/abs/2403.17124, new project website https://yanweiw.github.io/glide/

Framework: What Grounding Problem are we Solving?

Figure 7: High-level overview of our MMLP grounding framework. Given (a) an LLM-proposed mode sequence of how to solve a task and (b) a few human demonstrations that succeed in the task, we want to learn a grounding classifier that maps continuous states and observations in the demonstrations to one of the discrete manipulation modes. (c) To learn the classifier, our method requires a feasibility matrix prompted from LLM that describes the feasibility of transitions between modes in terms of cost. (d) Having learned the classifier, we can use it to segment demonstrations into grounded skills as well as to map the mode boundaries in the configuration space beyond the demonstration regions. In the case of scooping marbles, the boundary of the scooping mode (blue) separates the configuration space into whether the scooping spoon is in contact with the marbles. The boundary of the transporting mode (yellow) separates the configuration space into whether the spoon holds marbles on it. The rest of the configuration is in reaching mode (gray) where the spoon is empty. The downstream applications of our learned grounding are (e) explaining why perturbing a demonstration (black) sometimes fails an execution (dark gray; going from the reaching mode directly to the transporting mode without scooping is not physically feasible) but not other times (light gray); and (f) querying LLM to replan when detecting perturbations causing unexpected mode transitions.

How we use LLMs and Learn the Mode Classifier?

In our method, LLMs are used to (i) select a subset of relevant features as the task description, (ii) generate the feasibility matrix and number of modes for a text description of the task, (iii) replan at the discrete level given classifier-determined modes to sequence continuous mode-conditioned policy. Using these extracted data, we can design a loss that learns the mode classifier.

2D Polygon Experiment

We generate a sequence of connected polygons representing manipulation modes of a long-horizon task projected to the 2D domain. Successful demonstrations will start in the white space (the first mode) and traverse the color tiles until reaching the last color tile (the final mode) The specific color of each tile does not have semantic meaning other than that distinct color means distinct mode. Invalid transitions for 3-, 4-, and 5-mode tasks are described by the feasibility matrix .

Polygons: Demonstration Trajectories and Perturbations

Successful demonstrations for a 5-mode task

Perturbed trajectories that are successful

Perturbed trajectories that are failing

Polygons: Discovering Mode Families

Learned classifier explains success trajectories by segmenting them into valid mode transitions (left column) that incur 0 cost (right column; 0 cost shown by the whiteness)

Learned classifier explains failure trajectories by segmenting them into mode transitions (left column) with at least one invalid transition (indicated by the black dot in the right column)

Policy execution with automatic recovery. We inject random perturbations at execution time. Our mode-informed policy can automatically recover from them and reach the goal.

rrt_planner_4_success.mp4

(See more execution results below)

Polygons: Qualitative and Quantitative Evalutions

(a) Ground truth + demonstrations. We report the accuracy of each following method overlapping with the ground truth at the top. (b) Our method. (c) Our method w/o counterfactual failure data. Without counterfactual failure data (i.e., failure loss is 0), our method can still learn the correct number of partitions but the boundaries do not match ground truth. (d) Our method with an incorrect number of modes. Without knowing the correct number of modes (5-mode feasibility matrix for 3-, 4-mode tasks, and 4-mode feasibility matrix for a 5-mode task) the learned partition either over-segment or under-segment. (e) Trajectory segmentation using clustering as a baseline. In this case, we use KMeans to cluster all states along successful trajectories and use a simple kNN classifier to classify all states into mode families.

Trajectory segmentation accuracies of different models. The accuracy is computed by classifying all states along successful trajectories into mode families and compare them against the ground-truth modes.

(a)

(b) Overlap=0.989

(d) 5-Mode Matrix

(e) Clustering

(a)

(b) Overlap=0.989

(d) 5-Mode Matrix

(e) Clustering

(a)

(b) Overlap=0.975

(d) 4-Mode Matrix

(e) Clustering

Polygons: Execution under Perturbation

strict_following_1_fail.mp4

A simple baseline that strictly tracks waypoints.

A straightforward baseline that leverages trajectory segmentations is to extract a sequence of waypoints to follow based on the segmentations. We implemented this baseline and show that it will fail under external perturbation because its unawareness of the mode family boundaries.

rrt_planner_1_success.mp4

rrt_planner_2_success.mp4

rrt_planner_4_success.mp4

Our full method.

Our method leverages the learned mode families to choose next waypoints and use motion planning to generate paths that avoid crossing invalid mode boundaries. It can automatically recover from perturbations.

Robosuite Experiments

We leverage the existing manipulation benchmarks of Zhu et al. to test our grounding method. Specifically, we tested our method on the 'can', 'lift' and 'square_peg' domains. For these domains, we trained a standard behavioral cloning policy (using the standard configuration in robomimic) and our proposed mode-conditioned policy. Below we show both data that supports that our method is able to learn an accurate mode classifier and videos that show policy rollouts, particularly focusing on instances when perturbations (i.e., disturbances) are applied to the system randomly.

Robosuite: Trajectory Segmentations and Mode Classifications

To verify the performance of our mode classification, we constructed a manual 'ground truth.' For each task, we tested three conditions on how well the mode classification overlaps with the ground truth: (1) our proposed method that leverages LLMs to ground modes, (2) a baseline trajectory segmentation approach and, (3) a version of our method that uses the full state space instead of the culled state space from the LLM (to highlight the value of a reduced state space). Additionally, we provide qualitative results below for how our method segments compared to the ground truth.

Example Segmentations

Our method achieves high mode classification accuracy. To the left are qualitative examples showing the mode classification of our method. You can view pictures of all of the classification results here.

An example of replanning

In this example from the "Lifting" domain, you can see how our method performs under a perturbation that causes the robot to drop the block. In this case, our method realizes the mode transition and resorts back to the initial mode, and it moves to grasp the block.

Robosuite: Execution under Perturbation

"Can" Task

The goal of this task is to grab the can and deposit it in the bottom-right bin. Behavioral cloning performance drops off significantly with perturbation, but our method generally recovers better after dropping the can.

can_BC.mp4

Behavior Cloning without Perturbations (success rate: 0.93)

can_BC_pert.mp4

Behavioral Cloning with Perturbations (success rate: 0.2)

can_MC_pert.mp4

Mode-conditioned Policy with Perturbations (success rate: 0.4)

"Lifting" Task

The goal of this task is to pick up the block above a specified height. Behavioral cloning performance drops off significantly with perturbation, but our method generally recovers better when perturbations misalign the gripper and block.

lift_BC.mp4

Behavior Cloning without Perturbations (success rate: 0.99)

lift_BC_pert.mp4

Behavioral Cloning with Perturbations (success rate: 0.18)

lift_MC_pert.mp4

Mode-conditioned Policy with Perturbations (success rate: 0.39)

"Square Peg" Task

The goal of this task is to pick up the square nut and place it on the square peg. Behavioral cloning performance drops off significantly with perturbation, but our method generally recovers better when regrasping the nut and during perturbations near peg insertion.

square_BC.mp4

Behavior Cloning without Perturbations (success rate: 0.52)

square_BC_pert.mp4

Behavioral Cloning with Perturbations (success rate: 0.13)

square_MC_pert.mp4

Mode-conditioned Policy with Perturbations (success rate: 0.21)

Real Robot Experiments

Real Robot Tracing Task: Trace through the polygons

The goal of the task is for the robot end effector to pass through the polygons from green to red (similar to the simulated polygon task). The only valid order is from green to blue to yellow to red. Anything else (e.g., white to blue) is considered an invalid mode transition.

Data Collection

We record the robot's end effector position to learn the mode classifier. We collect a small handful of demonstrations and then replay with random perturbations to create a large dataset without human supervision. From this dataset and a terminal state function (i.e., did the task succeed), we are able to learn the mode boundaries.

trace_demo_collect.mp4

Collecting a small number of demonstrations by kinesthetically guiding the robot

trace_perturb_collect.mp4

Randomly perturbing the demonstrations to generate new successful and failed trajectories. Note this data collection is autonomous.

Example of the data collected through this pipeline (which is used to learn the classifier)

Leveraging Learned Modes

Once we learn the classifier, we can use it to generate a mode-informed policy that replans when a perturbation would lead the original policy to incur an invalid mode transition (e.g., white to blue)

Mode classification results: from the perturbed trajectories, our method is able to generally recover the mode boundaries.

tracing_perturb_no_recovery.mp4

Naive policy: without knowledge of the modes, the robot simply continues after perturbation which can lead to invalid mode transitions.

tracing_perturb_recovery.mp4

Mode-informed policy: By accurately recognizing the underlying modes, our method allows for recovery when perturbations cause mode transitions.

Robot Scooping Task: Scoop and transport at least one marble to a target bowl

Successful Demonstrations of Scooping Marbles

scoop_succ_collect.mp4

On the left a human generates successful demonstrations that start in various states and take various paths to the goal state. They share an implicit discrete plan of mode sequence: Reaching -> Scooping -> Transporting -> Dropping. Scooping mode is defined by the spoon making contact with any marbles; transporting mode is defined as the spoon holding at least one marble; dropping mode is defined as the spoon reaching above the target bowl location such that rotating the spoon will drop the marbles into the target bowl; everything else is in the reaching mode. The demonstration we collected consists of robot end-effector x, y, z, quaternion, robot joint states, and a mask of the spoon from the wrist camera view as shown above on the right. In particular, we prompt GPT4-Vision (vision language foundation model) to inform the system that tracking if there are marbles on the spoon is critical for this task. The labels "Spoon" and "Red Marbles" are given to SAM (segment everything - vision foundation model) to segment the spoon mask. Note that the mask is a complete ellipse when the spoon is empty. Otherwise, it is a partial ellipse due to marble occlusions. We use this binary mask as the state representation for the wrist camera.

Counterfactual Demonstrations of Failing to Scoop Marbles

scoop_fail_collect.mp4

On the left a human demonstrates various ways perturbations can be added to fail a successful scooping trajectory, which we use as counterfactuals to the successful demonstrations shown above. On the right the corresponding wrist camera view is shown. Note that the labels "No Marble" and "Has Marble" is not given to the learning system and are only displayed for visualization purposes.

Without Mode Abstractions, Imitation does not Guarantee Execution Success Under Perturbations

scoop_non_reactive.mp4

We show that without mode abstraction, the robot cannot reason that it is in a different mode when perturbations cause the scoop to drop all marbles. Consequently, the task execution fails even when motion imitation is successful.

With Learned Grounding Classifier, Robot Replans with LLM to Achieve Success under Perturbations

scoop_reactive.mp4

We show that the learned grounding classifier allows the system to map from pixel observations to a discrete symbolic space where LLM can replan when perturbations derail the original plan that's been demonstrated.

Visualizing Successful Demonstrations of Scooping and Learned Mode Classifier

We prompt LLM to generate a subset of features relevant to predicting task success: X, and Y locations of the robot end-effector in the robot base frame as well as the wrist camera mask. Due to a lack of contact sensors, we omit the scooping mode, and prompt the LLM to generate a plan: Reaching -> Transporting -> Dropping (assuming scooping is always successful when transitioning from the reaching mode to the transporting mode. The corresponding feasibility matrix is F3. In the top figure, we plot demonstration data in X and Y and use the color of the scattered plot to indicate ground truth modes (reaching is red, transporting is green, and dropping is blue). Examples of logged spoon masks along these trajectories are shown at the top. At the bottom, we visualize the learned classifier, which has correctly learned three modes (indicated by three distinct colors) by partitioning the space according to X and Y locations as well as the masks. Note that (1) while the shape of the learned dropping mode is not perfect, which should be improved as we collect more demonstrations and counterfactuals, the location of the mode corresponds to that of the dropping bowl location; (2) the classifier successfully learns the threshold function that turns the continuous mask values to the discrete information that differentiates reaching mode from transporting mode given the same X, Y locations.

LLM Prompts and Responses

Prompt:

You are an expert in generating robot action plans and features.
Given a language description of a task, such as "clean the plate in a sink," you should first generate an abstract plan for the task. Put them in <plan></plan>. The plan is a list of steps. You should ignore object finding or localization actions. For example,<plan>steps = [{'id': 1, 'desc': 'Reach the plate'}, {'id': 2, 'desc': 'Close gripper to grasp the plate'}, {'id': 3, 'desc': 'Move to the sink'}, {'id': 4, 'desc': 'Turn on the faucet'}]</plan>
Next, you should generate a feasibility matrix between all the steps. Put them in <feasibility></feasibility><feasibility>feasibility = { (1, 2): True, # after reaching the plate, we can directly close the gripper (1, 3): False, # after reaching the plate, we can't directly move to the sink (1, 4): False, # after reaching the plate, we can't directly turn on the faucet (2, 3): True, # after closing the gripper, we can directly move to the sink (2, 4): False, # after closing the gripper, we can't directly turn on the faucet (3, 4): True, # after moving to the sink, we can't directly turn on the faucet}</feasibility>
The user will also give you a list of available features, such as robot poses, object poses. An example is the following:<available_features>avaiable_features = ['plate_pos', 'plate_quat', 'plate_to_robot0_eef_pos', 'plate_to_robot0_eef_quat', 'sink_pos', 'sink_quat', 'sink_to_robot0_eef_pos', 'sink_to_robot0_eef_quat', 'faucet_pos', 'faucet_quat', 'faucet_to_robot0_eef_pos', 'faucet_to_robot0_eef_quat', 'robot0_eef_pos', 'robot0_eef_quat', 'gripper_state']</available_features>We will use the following convention: XXX_pos and XXX_quat denote the pose of the object. XXX_to_robot_eef denotes the relative pose between the object and the robot end-effector.
You should select a subset of features. Such as:<selected_features>features = [ 'plate_to_robot0_eef_pos', 'plate_to_robot0_eef_quat', 'sink_to_robot0_eef_pos', 'sink_to_robot0_eef_quat', 'faucet_to_robot0_eef_pos', 'faucet_to_robot0_eef_quat' ,' 'gripper_state'']</selected_features>
For each step, you also need to generate a "goal" pose, which is either an absolute pose of the robot or a relative pose between the robot and another object. In this case,<goal>goals = [ {'id': '1', 'feature': 'plate_to_robot0_eef_pos'}, {'id': '2', 'feature': 'plate_to_robot0_eef_pos'}, {'id': '3', 'feature': 'sink_to_robot0_eef_pos'}, {'id': '4', 'feature': 'faucet_to_robot0_eef_pos'},]</goal>

Task 1: Can

Task: move the can from tray 1 to tray 2. Both trays are fixed. Available features: <available_features>available_features= ['can_pos', 'can_quat', 'can_to_robot0_eef_pos', 'can_to_robot0_eef_quat', 'robot0_eef_pos', 'robot0_eef_quat', 'gripper_state'] </available_features>
<plan>steps = [{'id': 1, 'desc': 'Reach the can in tray 1'}, {'id': 2, 'desc': 'Close gripper to grasp the can'}, {'id': 3, 'desc': 'Move to tray 2'}, {'id': 4, 'desc': 'Open gripper to release the can'}]</plan>
<feasibility>feasibility = { (1, 2): True, # after reaching the can in tray 1, we can directly close the gripper (1, 3): False, # after reaching the can in tray 1, we can't directly move to tray 2 (1, 4): False, # after reaching the can in tray 1, we can't directly open the gripper (2, 3): True, # after closing the gripper, we can directly move to tray 2 (2, 4): True, # after closing the gripper, we can directly open the gripper (3, 4): False, # after moving to tray 2, we can't directly open the gripper}</feasibility>
<selected_features>features = ['can_to_robot0_eef_pos', 'can_to_robot0_eef_quat', 'robot0_eef_pos', 'robot0_eef_quat', 'gripper_state']</selected_features>
<goal>goals = [ {'id': '1', 'feature': 'can_to_robot0_eef_pos'}, {'id': '2', 'feature': 'can_to_robot0_eef_pos'}, {'id': '3', 'feature': 'robot0_eef_pos'}, {'id': '4', 'feature': 'robot0_eef_pos'},]</goal>

Task 2: Lift

Task: lift a block from the table. Available features:<available_features> available_features= ['block_pos', 'block_quat', 'block_to_robot0_eef_pos', 'block_to_robot0_eef_quat', 'robot0_eef_pos', 'robot0_eef_quat', 'gripper_state']</available_features>
<plan>steps = [{'id': 1, 'desc': 'Move to the block'},         {'id': 2, 'desc': 'Close gripper to grasp the block'},         {'id': 3, 'desc': 'Lift the block'}]</plan>
<feasibility>feasibility = {  (1, 2): True, # after moving to the block, we can directly close the gripper  (1, 3): False, # after moving to the block, we can't directly lift the block  (2, 3): True, # after closing the gripper, we can directly lift the block}</feasibility>
<selected_features>features = ['block_to_robot0_eef_pos', 'block_to_robot0_eef_quat', 'robot0_eef_pos', 'robot0_eef_quat', 'gripper_state']</selected_features>
<goal>goals = [  {'id': '1', 'feature': 'block_to_robot0_eef_pos'},  {'id': '2', 'feature': 'block_to_robot0_eef_pos'},  {'id': '3', 'feature': 'robot0_eef_pos'},]</goal>

Task 3: Nut

Task: insert a nut with a handle to a peg. Available features:<available_features> available_features= ['nut_pos', 'nut_quat', 'nut_to_robot0_eef_pos', 'nut_to_robot0_eef_quat', 'handle_pos', 'handle_quat', 'handle_to_robot0_eef_pos', 'handle_to_robot0_eef_quat', 'peg_pos', 'peg_quat', 'peg_to_robot0_eef_pos', 'peg_to_robot0_eef_quat', 'robot0_eef_pos', 'robot0_eef_quat', 'gripper_state']</available_features>
<plan>steps = [{'id': 1, 'desc': 'Reach the nut'}, {'id': 2, 'desc': 'Close the gripper to grasp the nut'}, {'id': 3, 'desc': 'Move to the peg'}, {'id': 4, 'desc': 'Insert the nut into the peg'}]</plan>
<feasibility>feasibility = { (1, 2): True, (1, 3): False, (1, 4): False, (2, 3): True, (2, 4): False, (3, 4): True,}</feasibility>
<selected_features>features = ['nut_to_robot0_eef_pos', 'nut_to_robot0_eef_quat', 'peg_to_robot0_eef_pos', 'peg_to_robot0_eef_quat', 'gripper_state']</selected_features>
<goals>goals = [ {'id': '1', 'feature': 'nut_to_robot0_eef_pos'}, {'id': '2', 'feature': 'nut_to_robot0_eef_pos'}, {'id': '3', 'feature': 'peg_to_robot0_eef_pos'}, {'id': '4', 'feature': 'peg_to_robot0_eef_pos'},]</goals>