Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos

Annie S. Chen, Suraj Nair, Chelsea Finn

Stanford University

Robotics: Science and Systems, 2021

Domain-agnostic Video Discriminator (DVD)

Reward functions are critical in order to develop general purpose robots, as they are necessary for the robot to determine its own proficiency at specified tasks.

Goal: How can we learn a reward function that generalizes across environments and tasks?

Key Insight: Leverage a diverse dataset of "in-the-wild" human videos. While such data is often challenging for a robot to learn from due to tremendous domain shift from the robot's observation space, it is plentiful and easily accessible, and the breadth of experience available may allow for reward functions that can generalize more broadly.

We propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.

These reward functions can generalize to unseen environments and tasks by learning from a small amount of robot data and a large, diverse dataset of human videos.

How do we use DVD to perform tasks?

Given a human demo video, we use DVD along with visual model predictive control (VMPC) to choose actions that complete the task specified by the video.
We sample action sequences to get "imagined" future trajectories and choose the sequence that DVD scores as having the highest functional similarity to the human demo video.

DVD_website_video.mp4

Experiments

Can DVD generalize across environments & tasks using human videos?

Key Takeaways:

By leveraging human video datasets, along with as few as 20 robot demos per task, DVD can capture the functional similarity between videos from drastically different visual domains.
Training with diverse human videos significantly improves environment and task generalization performance over training with only robot videos.
DVD is robust to the number of human video tasks included during training, even if these tasks are unrelated to the target tasks.

Closing a drawer at test time:

Below: Human video demo
Right: Agent completing the task using DVD in various environments. All but the top left environment are unseen at test time.

Moving a faucet to the right at test time:

Below: Human video demo
Right: Agent completing the task using DVD in various environments. All but the top left environment are unseen at test time.

Can DVD infer rewards from a human video on a real robot?

Tissue box 2.MOV

Close door 4.mov