The (Un)Surprising Effectiveness of
Pre-Trained Vision Models for Control

Simone Parisi*, Aravind Rajeswaran*, Senthil Purushwalkam, Abhinav Gupta

Meta AI (FAIR), CMU

Paper | Code

Long Oral at the International Conference on Machine Learning (ICML) 2022

TLDR: Policies trained using frozen pre-trained visual representations (PVRs) can match or outperform policies trained using ground-truth states for a variety of control tasks.

Abstract

Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa learning paradigm, with visuo-motor policies often trained from scratch using data from deployment environments.

In this context, we revisit and study the role of pre-trained visual representations for control, and in particular representations trained on large-scale computer vision datasets.

Through extensive empirical evaluation in diverse control domains (Habitat, DeepMind Control, Adroit, Franka Kitchen), we isolate and study the importance of different representation training methods, data augmentations, and feature hierarchies.

Overall, we find that pre-trained visual representations can be competitive or even better than ground-truth state representations to train control policies. This is in spite of using only out-of-domain data from standard vision datasets, without any in-domain data from the deployment environments.

Main Findings

  1. Frozen PVRs trained on completely out-of-domain datasets can be competitive with or even outperform ground-truth state features for training policies (with imitation learning). We emphasize that these vision models have never seen even a single frame from our evaluation environments during pre-training.

  2. Self-supervised learning (SSL) provides better features for control policies compared to supervised learning.

  3. Crop augmentations appear to be more important in SSL for control compared to color augmentations.

  4. Early convolution layer features are better for fine-grained control tasks (MuJoCo) while later convolution layer features are better for semantic tasks (Habitat ImageNav).

  5. By combining features from multiple layers of a pre-trained vision model, we propose a single PVR that is competitive with or outperform ground-truth state features in all the domains we study.

The PVR Framework

Can we build a single vision model, pre-trained entirely on out-of-domain datasets, that works for any control task?

In the classic tabula-rasa paradigm (left), the perception module is part of the control policy and is trained from scratch on data from the environment.
By contrast, in our paradigm (right) the perception module is detached from the policy. First, it is trained once on out-of-domain data (e.g., ImageNet) and frozen. Then, given some tasks, control policies are trained on the deployment environments re-using the same frozen perception module.

Why is it important for control?

  • Perfect features are available only in simulation, for real-world we need to rely on raw inputs.

  • Training the perception module from scratch is data hungry and requires expertise.

Domains

We evaluate different PVRs on diverse control domains: Habitat (5 scenes), Adroit (2 hardest tasks), DeepMind Control Suite (5 tasks), and Franka Kitchen (5 tasks). These tasks are diverse both in their visual characteristics (realistic images vs cartoonish simulators) as well as desired behaviors (navigation vs low-level locomotion vs manipulation).

Off-The-Shelf Models Evaluation

  • We download PVRs from the internet and plug them in our control policy.

  • We use ResNet networks trained with classic supervised learning, with momentum contrast (MoCo), or with contrastive language-image pretraining (CLIP). We also use a vision transformer (ViT) pre-trained with CLIP.

  • All the models have been pre-trained on ImageNet.

  • Further baselines are a manually-designed randomly-initialized convolutional network, that is either frozen (random features baseline) or trained from scratch together with the policy (end-to-end tabula-rasa baseline).

  • Finally, ground-truth features are compact features provided by the simulator and describe the full state of the agent and environment. They are an "oracle" baseline we strive to compete with.

  • Any PVR is clearly better than both frozen random features and learning the perception module from scratch.

  • No PVR is clearly superior to any other across all four domains. On average, SSL models (MoCo) are better than SL models (RN50, CLIP).

  • MoCo is competitive with ground-truth features in Habitat, but no off-the-shelf PVR can match the ground-truth features in MuJoCo.

Feature Hierarchies for Control

After investigating the importance of using different pre-training datasets (like Places) and augmentations (color vs crop), we found out that invariances for semantic recognition may not be ideal for control.

Therefore, we use different layers output from PVRs as features for the control policy, and achieve the most interesting results.

  • Later layer features are better for high-level semantic tasks (Habitat ImageNav).

  • Early layer features are better for fine-grained control tasks (MuJoCo).

  • Why? Because low-level features encode spatial information that are needed for in-hand manipulation, interactions with the environment, and locomotion.


Can a PVR combining features from many layers work on all domains?

  • Any PVR with layer-5 works on Habitat.

  • Any PVR with layer-3 works on MuJoCo.

  • The PVR with layer-345 works on both.

This PVR has never seen even a single frame from these environments!

This PVR is pre-trained on out-of-domain data and was successfully transferred to all the domains we tried!