Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

This is the support material for paper : Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

Abstract:

Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, and Drawer World to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G can significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting.

Method

Overview of PIE-G. This figure shows the framework of PIE-G where visual encoders embed highdimensional images into low-dimensional representations for downstream tasks. Instead of training the encoder from scratch, PIE-G selects an ImageNet pre-trained ResNet model as the encoder and freezes its parameters during the training process.

Choice of Layers

Reward: 966

Layer 1:

Layer 3:

Layer 2:

Layer 4:

As shown in the above gifs, the early layers preserve rich details of edges and corners, while the later layers only provide very abstract information. Intuitively, for control tasks, a trade-off is required between low-level details and high-level semantics. The feature map of Layer 2 largely preserves the outline of the Walker that is advantageous to the control tasks, and at the same time discards redundant details.

Feature maps on generalization settings

As shown in the above figure, the pretrained visual representations learned from other domains can accurately capture the structure in the image observations and well adapt to the change of the visual appearance.

Color Jittered

Reward: 925

Layer 1

Layer 2

Dynamic background

Reward: 832

Layer 1

Layer 2

The above gifs suggest that the feature maps generated from the off-the-shelf pre-trained encoder can capture and distinguish the main components of different tasks’ observations regardless of the changes of visual appearances.

DMC-GB

PIE-G:

Reward: 685

Reward: 905

Reward: 882

Reward: 801

Reward: 324

Reward: 936

SVEA:

Reward: 484

Reward: 356

Reward: 652

Reward: 489

Reward: 46

Reward: 870

Visualized feature map differences of two inputs from the same state with different backgrounds. The difference of the feature maps with PIE-G as the encoder is closer to zero than that with SVEA, indicating PIE-G enjoys better generalization ability.

Drawer World

PIE-G:

Marble (Close)

Wood (Close)

Blanket (Close)

Marble (Open)

Wood (Open)

Blanket (Open)

SVEA:

Marble (Close)

Wood (Close)

Blanket (Close)

Marble (Open)

Wood (Open)

Blanket (Open)

Generalization on Drawer World. Evaluation on distracting textures. PIE-G is robust to the texture changing.

We conduct experiments on the Drawer World benchmark to test the agent’s generalization ability in manipulation tasks with different background textures. PIE-G can achieve better or comparable generalization performance in all the settings with +24% boost on average while other approaches may suffer from the CNN’s sensitivity in the face of various textures.

Manipulation Tasks

PIE-G:

Training

Deforming Arm

Deforming Brick

Modified Arm

Modified Platform

Modified Both

SVEA:

Training

Deforming Arm

Deforming Brick

Modified Arm

Modified Platform

Modified Both

For the manipulation tasks, the colors of different objects (e.g., floors, arms) are modified. These results suggest that the visual representation from the pre-trained model is more robust to the color-changing than the one trained by standard RL algorithms. Furthermore, we modify the shapes of the jaco arm and the target objects. PIE-G also improves the agents’ generalization ability with various shapes while other methods could barely generalize to these changes.

CARLA Autonomous Driving

PIE-G:

Training

SoftRainNoon

WetSunset

HardRainSunset

SVEA:

Training

SoftRainNoon

WetSunset

HardRainSunset

Code：https://anonymous.4open.science/r/PIE-G-EF75