Out-of-Distribution Recovery with Object-Centric Keypoint Inverse Policy For Visuomotor Imitation Learning

Spotlight Paper in Workshop on Lifelong Learning for Home Robots @ CoRL 2024

University of Pennsylvania

Spotlight Presentation

Abstract

We propose an object-centric recovery policy framework to address the challenges of out-of-distribution (OOD) scenarios in visuomotor policy learning. Previous behavior cloning (BC) methods rely heavily on a large amount of labeled data coverage, failing in unfamiliar spatial states. Without relying on extra data collection, our approach learns a recovery policy constructed by an inverse policy inferred from object keypoint manifold gradient in the original training data. The recovery policy serves as a simple add-on to any base visuomotor BC policy, agnostic to a specific method, guiding the system back towards the training distribution to ensure task success even in OOD situations. We demonstrate the effectiveness of our object-centric framework in both simulation and real robot experiments, achieving an improvement of 77.7% over the base policy in OOD.

The Problem

Base Policy 

In-Distribution

Most of the visuomotor policies can handle in-distribution data very well. All the demonstration data are on the left side of the white line. The bottle never appears on the right side.

Base Policy 

Out-of-Distribution :(

However, the base policy fails to execute the task when the bottle is placed on the right side (OOD). What if we can always bring the relevant object back to the distribution without any new data?

The OCR Framework

The OCR Framework augments a base policy, trained via BC, by returning task-relevant objects to their training manifold, where the base policy takes over. First, we model the distribution of object keypoints in the training data using a Gaussian Mixture Model (GMM). At test time, we compute the gradient of the GMM to derive object-recovery vectors, which are used to plan a recovery trajectory. This trajectory is then converted into robot actions through a Keypoint Inverse Policy, trained solely on the base dataset. Finally, the base policy and the recovery policy are combined into a joint policy, allowing seamless interaction between recovery and task execution.

Experiments

Push-T

Base Policy 

In-Distribution

All the demonstration data are on the left side of the space. The T shape never shows up on the right side.

Base Policy 

Out-of-Distribution

The base policy fails to execute the task when the T shape is placed on the right side.

Base + Recovery 

Out-of-Distribution

(Ours)

Square

Base Policy 

In-Distribution

All the demonstration data are on the right side of the space. The square never appears on the left side.

Base Policy 

Out-of-Distribution

The base policy fails to execute the task when the square is placed on the left side.

Base + Recovery 

Out-of-Distribution

(Ours)

Real Robot

Base Policy 

In-Distribution

All the demonstration data are on the left side of the white line. The bottle never appears on the right side.

Base Policy 

Out-of-Distribution

The base policy fails to execute the task when the bottle is placed on the right side.

Base + Recovery 

Out-of-Distribution

(Ours)