GenGap ICRA

Decomposing the Generalization Gap in Imitation Learning

for Visual Robotic Manipulation

Annie Xie*1,2 Lisa Lee*1 Ted Xiao1 Chelsea Finn1,2

1 Google DeepMind 2 Stanford University

Abstract

Real Robot Experiments

Factor World (Sim)

Tasks

Factors of variation

Additional Experimental Results

Implementation and Training Details

RT-1

Factor World

Abstract

What makes generalization hard for imitation learning in visual robotic manipulation? This question is difficult to approach at face value, but the environment from the perspective of a robot can often be decomposed into enumerable factors of variation, such as the lighting conditions or the placement of the camera. Empirically, generalization to some of these factors have presented a greater obstacle than others, but existing work sheds little light on precisely how much each factor contributes to the generalization gap. Towards an answer to this question, we study imitation learning policies in simulation and on a real robot language-conditioned manipulation task to quantify the difficulty of generalization to different (sets of) factors. We also design a new simulated benchmark of 19 tasks with 11 factors of variation to facilitate more controlled evaluations of generalization. From our study, we determine an ordering of factors based on generalization difficulty, that is consistent across simulation and our real robot setup.

Real Robot Experiments

We evaluate a real robot manipulator on over test scenarios featuring new lighting conditions, distractor objects, backgrounds, table textures, and camera positions.

Original Scene

Table

Background

Table + Background

Distractor Objects

Table + Distractor Objects

Lighting

Camera Position

Factor World (Sim)

We also design a suite of simulated tasks, equipped with customizable environment factors, which we call Factor World, to supplement our study. With over configurations for each factor, Factor World is a rich benchmark for evaluating generalization, which we hope will facilitate more fine-grained evaluations of new models, reveal potential areas of improvement, and inform future model design.

Tasks

Pick-Place

Bin Picking

Door Open

Basketball

Door Lock

Door Unlock

Button Press Topdown

Button Press Topdown Wall

Button Press

Button Press Wall

Drawer Close

Drawer Open

Faucet Close

Faucet Open

Handle Press

Handle Pull Side

Lever Pull

Window Close

Window Open

Factors of variation

Camera position

Table texture

Floor texture

Lighting

Object texture

Object position

Object size

Distractor object

Table position

Arm position

Additional Experimental Results

Generalization gap on all pairs of factors, reported as the percentage difference relative to the harder factor of the pair. Results are averaged across the 3 simulated tasks with 5 seeds for each task.

Generalization gap for different data augmentations, pretrained representations, and encoder architectures in Factor World. Subplots share the same x- and y-axes. Results are averaged across the 3 simulated tasks with 5 seeds for each task.

We vary the frame skip of the simulation, which specifies the number of simulation steps the input action is repeated for, between the values of 3, 5 (default), and 10. On average, the performance is slightly worse with a larger frame skip, i.e., lower control frequency.

The performance is similar across different image resolutions, where on average, the performance is slightly worse at the lower resolution

Implementation and Training Details

RT-1

Behavior cloning. We follow the RT-1 architecture that uses tokenized image and language inputs with a categorical cross-entropy objective for tokenized action outputs. The model takes as input a natural language instruction along with the 6 most recent RGB robot observations, and then feeds these through pre-trained language and image encoders (Universal Sentence Encoder (Cer et al, 2018) and EfficientNet-B3 (Tan et al, 2019), respectively). These two input modalities are fused with FiLM conditioning, and then passed to a TokenLearner (Ryoo et al, 2021) spatial attention module to reduce the number of tokens needed for fast on-robot inference. Then, the network contains 8 decoder only self-attention Transformer layers, followed by a dense action decoding MLP layer. Full details of the RT-1 architecture that we follow can be found in Brohan et al 2022.

Data augmentations. Following the image augmentations introduced in Qt-Opt (Kalashnikov et al, 2018), we perform two main types of visual data augmentation during training only: visual disparity augmentations and random cropping. For visual disparity augmentations, we adjust the brightness, contrast, and saturation by sampling uniformly from [-0.125, 0.125], [0.5, 1.5], and [0.5, 1.5] respectively. For random cropping, we subsample the full-resolution camera image to obtain a 300x300 random crop. Since RT-1 uses a history length of 6, each timestep is randomly cropped independently.

Pretrained representations. Following the implementation in RT-1, we utilize an EfficientNet-B3 model pretrained on ImageNet for image tokenization, and the Universal Sentence Encoder language encoder for embedding natural language instructions. The rest of the RT-1 model is initialized from scratch.

Factor World

Behavior cloning. Our behavior cloning policy is parameterized by a convolutional neural network: there are four convolutional layers with 32, 64, 128, and 128 4x4 filters, respectively. The features are then flattened and passed through a linear layer with output dimension of 128, LayerNorm, and Tanh activation function. The policy head is parameterized as a three-layer feedforward neural network with 256 units per layer. All policies are trained for 100 epochs.

Data augmentations. In our simulated experiments, we experiment with shift augmentations (analogous to the crop augmentations the real robot policy trains with): we first pad each side of the 84x84 image by 4 pixels, and then select a random 84x84 crop. We also experiment with color jitter augmentations (analogous to the photometric distortions studied for the real robot policy), which is implemented in torchvision. The brightness, contrast, saturation, and hue factors are set to 0.2. The probability that an image in the batch is augmented is 0.3. All policies are trained for 100 epochs.

Pretrained representations. We use the ResNet50 versions of the publicly available R3M and CLIP representations. We follow the embedding with a BatchNorm, and the same policy head parameterization: three feedforward layers with 256 units per layer. All policies are trained for 100 epochs.

Model architectures. We use the ResNet18 architecture. We also design a custom Vision Transformer that takes in 84x84 images with the following configuration: patch size of 6, hidden dimension of 192, MLP dimension of 768, 4 layers, and 4 heads. These encoders are followed by the same policy head parameterization. All policies are trained for 20 epochs.

Page updated

Report abuse

Decomposing the Generalization Gap in Imitation Learning

for Visual Robotic Manipulation

Table of Contents

Abstract

Real Robot Experiments

Factor World (Sim)

Tasks

Factors of variation

Additional Experimental Results

Implementation and Training Details

RT-1

Factor World