Home

It Takes Two: Learning to Plan for Human-Robot Cooperative Carrying

ICRA 2023 Paper Code Video BibTex

***If you would like to use the cooperative carrying gym environment to benchmark your cooperative algorithms, the repository is HERE.***

Learning to plan cooperatively from human-human demonstrations

Cooperating with humans on a physical joint-action task requires accounting for 1) multimodality of strategies, and 2) changes in strategy during interaction. We address these difficulties by training a sampling-based planner to generate cooperative motion from human-human demonstrations, and demonstrate its capability to synthesize with real humans using online, receding horizon planning on the cooperative carrying task.

Variational Recurrent Neural Network (VRNN) for Cooperative Planning

We adapt the Variational Recurrent Neural Network (VRNN) to autoregressively predict the change in pose of the table. With context from the first ~1 sec of observations, the planner can sample a batch of the next ~3 sec of planned waypoints. A provided reward function can then select the best plan, and be customized to optimize for fluent interactions (i.e. minimize interaction forces, etc.). The planner is used in a receding horizon manner.

Evaluating planners for interaction

Evaluating human cooperation involves several different aspects, including motion generation quality, task success, and interaction quality.

Motion generation quality

- Fréchet distance (FD): the distribution of trajectories with lower FD with respect to the distribution of ground truth (GT) human-human demonstrations indicate matching the distribution of the trajectories from the human-human dataset
- L2 distance: trajectories with lower L2 norm are closer to GT trajectories
- Variance: higher variance indicates higher diversity of movement. High variance in the rotational component of motion is of particular interest for this task, as it indicates a greater degree of turning imparted by individual agents

Plans generated by the VRNN have greater similarity to ground truth trajectories and higher diversity (Table I).

Synthesizing with real humans (User study)

We examined the VRNN planner's capabilities with a human-in-the-loop and compared it to a baseline decentralized RRT planner on two additional facets: task success and interaction quality.

Task success

Success rate (%): successful trials (reached goal without collision) out of the total number of trials per individual. We evaluated on novel map configurations unseen during training.
Average time for completion (sec) for each trial

When a human in the loop is added, the VRNN can exhibit higher diversity despite lower similarity to the ground truth (Table II), which contributes to its ability to achieve a higher success rate on novel map configurations (Table III).

Dec-RRT (blue) planning with a human (orange).
Note how the human initially moves above the top obstacle, but begins moving downwards once they realize that the Dec-RRT planner will not move above the obstacle with them. The team is unable to correct for rotation and collides with an obstacle.

VRNN (blue) planning with the same human (orange). Note how the VRNN coordinates with the human to move above the top obstacle, despite initially planning to navigate below it.

Turing Test experimental setup.

Interaction quality

Turing test: can people detect the human-likeness of each planner? In this user study, participants played 15 trials with a partner who was either one of the planners, or a real human for 5 of the trials. Participants were asked whether they thought they were playing with a human or a robot after each trial. A two-sample t-test was conducted to determine whether there was a noticeable difference in response for "human" when the VRNN planner was the partner, as opposed to the Dec-RRT planner.
Interaction forces: we plot the interaction forces between each player on several trajectories to understand how the interaction occurred over the course of each trajectory, and how it potentially explains failure cases

Results from the two-sample t-test showed that participants notably responded more frequently that they believed they were playing with a human when in actuality, they were playing with the VRNN planner as opposed to while they were playing with the Dec-RRT planner (VRNN 45% vs. Dec-RRT 22%, p < .001, d = 0.2267, N = 15).

The violin plot illustrates the confusion matrix with the responses for each planner. The format for the condition is Predicted/Actual, i.e. H/R denotes predicted human, actual robot, etc.

As further insight into the quality of interaction, we plotted the interaction forces (normalized over the min-max from all interactions). Interaction forces lend insight into the "wasted energy" from an interaction, i.e. any energy in compressing or stretching the table due to dissent. The initial and final orientation of the table from cooperating with each planner is shown for each trial. In each scenario, the VRNN planner clearly exhibits less dissent with human partners across scenarios. The VRNN clearly acts in a more coordinated manner than the Dec-RRT planner.

In the left test holdout map scenario (i.e. the map was seen in the VRNN training), the team chooses the same path above the obstacle, but the VRNN planner clearly exhibits lower interaction forces over the trajectory.

In the middle unseen map configuration, the VRNN successfully rotates with the human partner to avoid the obstacle, while the Dec-RRT planner does not.

A majority of the failed trajectories from the unseen map condition during play with the Dec-RRT planner (24% success rate) resulted from inability to negotiate with the human early in the interaction, which results in contact with an obstacle that is nearby the table's initial position. In the map configuration on the right, the Dec-RRT planner chooses to move above the obstacle, but fails to negotiate with the human in time to avoid it.

Real robot demonstration

We also ran the VRNN planner on an Interbotix Locobot pinned to another Locobot teleoperated by a real human. Please see the video for more details.