GenGap 2023

Decomposing the Generalization Gap in Imitation Learning

for Visual Robotic Manipulation

Annie Xie*1,2 Lisa Lee*1 Ted Xiao1 Chelsea Finn1,2

1 Google DeepMind 2 Stanford University

Abstract

What makes generalization hard for imitation learning in visual robotic manipulation? This question is difficult to approach at face value, but the environment from the perspective of a robot can often be decomposed into enumerable factors of variation, such as the lighting conditions or the placement of the camera. Empirically, generalization to some of these factors have presented a greater obstacle than others, but existing work sheds little light on precisely how much each factor contributes to the generalization gap. Towards an answer to this question, we study imitation learning policies in simulation and on a real robot language-conditioned manipulation task to quantify the difficulty of generalization to different (sets of) factors. We also design a new simulated benchmark of 19 tasks with 11 factors of variation to facilitate more controlled evaluations of generalization. From our study, we determine an ordering of factors based on generalization difficulty, that is consistent across simulation and our real robot setup.

Real Robot Experiments

We evaluate a real robot manipulator on over test scenarios featuring new lighting conditions, distractor objects, backgrounds, table textures, and camera positions.

Original Scene

Table

Background

Table + Background

Distractor Objects

Table + Distractor Objects

Lighting

Camera Position

Factor World (Sim)

We also design a suite of simulated tasks, equipped with customizable environment factors, which we call Factor World, to supplement our study. With over configurations for each factor, Factor World is a rich benchmark for evaluating generalization, which we hope will facilitate more fine-grained evaluations of new models, reveal potential areas of improvement, and inform future model design.