RRL: Resnet as representation for Reinforcement Learning
International Conference on Machine Learning (ICML) 2021

Rutav Shah* (IIT-Kharagpur) | Vikash Kumar* (Facebook AI)

Resnet as representation for Reinforcement Learning (RRL) is a simple yet effective approach for training behaviors directly from visual inputs. We demonstrate that features learned by standard image classification models are general towards different task, robust to visual distractors, and when used in conjunction with standard Imitation Learning or Reinforcement Learning pipelines can efficiently acquire behaviors directly from proprioceptive inputs.

Overview

The ability to autonomously learn behaviors via direct interactions in uninstrumented environments can lead to generalist robots capable of enhancing productivity or providing care in unstructured settings like homes. Such uninstrumented settings warrant operations only using the robot’s proprioceptive sensor such as onboard cameras, joint encoders, etc. which can be challenging owing to the high dimensionality and partial observability issues. We present a surprisingly simple method (RRL) at the intersection of representation learning, imitation learning, and reinforcement learning that leverages features from standard pre-trained image classification models (Resnet34) as representations to deliver contact-rich dexterous manipulation behaviors.

Key Insight

Current successes in learning directly from high dimensional visual inputs primarily rely on acquiring compressed latent representations to be used as inputs to the RL pipeline.

Such representations learned by supervised/ unsupervised methods are task-specific, brittle to distribution shift, and suffer from non-stationarity issues due to the mismatch between distributions policy and representations are trained over.

Key idea – Representations do not necessarily have to be trained on the exact task distribution; a representation trained on a sufficiently wide distribution of real-world scenarios will be robust to scene variations, and will remain effective on any distribution a policy optimizing a 'task in the real world' might induce.

Standard image classification models trained on large corpus of real world images will remain effective over sufficiently wide real world distribution, including real-world scenarios involving robots.

Task Suite

We note that commonly used benchmarks for studying learning from visual inputs aren't representative of real world scenarios. We instead evaluate RRL on more challenging, high-dimensional, contact rich, dexterous manipulation tasks from ADROIT Manipulation Benchmarks.

Commonly used VisualRL Benchmarks

Simple, low dimensional, planar tasks
Devoid of depth perspective
Not representative of real-world scenario.

ADROIT Manipulation Benchmark

High Dimensional tasks much more difficult to solve.
Complex environments, has rich depth perspective.
Contact rich, a better representative of real-world scene.

Feature Comparison

On our task suite we first present a qualitative feature comparison between our task and closely related real-world images using GradCAM (layer-4 of Resnet model of the top 1 class) visualization. We note that there is a close resemblance between attended features despite the fact that Resnet never encountered images with robot hand during training. However, the question still remains if these features are informative enough, and will remain consistent, throughout the distribution induced by the task-policies.

Results

Leveraging these features, RRL delivers natural human like behaviors trained directly form proprioceptive inputs. Presented below are behaviors acquired on ADROIT manipulation benchmark task suite rendered from the camera viewpoint. We also overlay the visual features (layer-4 of Resnet model of the top 1 class using GradCAM) to highlight the features RRL is attending to during the task execution. Even though standard image classification models aren't trained with robot images, we emphasize that the features they acquire, by training on large corpus of real world scenes, remain relevant for robotics tasks that are representative of real world and rich in depth perspective (even in simulated scenes).

Hammering a nail

Opening a door

Pen-twirling

Object relocation

Baseline Comparison

Compared to baselines, which often fail and are sensitive to hyper-parameters, RRL demonstrates relatively stable and monotonic performance; often matching the efficiency of state based methods. We present comparisons with methods that learn directly from state (oracle) as well as ones that uses proprioceptive visual inputs.

NPG(State) : State of the art policy gradient method struggles to solve the suite even with privileged low level state information, establishing the difficulty of the suite.

DAPG(State) : A demonstration accelerated method using privileged state information, can be considered as an oracle of our method.

RRL(Ours) : Demonstrates stable performance and approaches performance of DAPG(State).

FERM : A competing baseline; shows good initial, but unstable, progress in a few tasks and often saturates in performance
before exhausting our computational budget (40 hours/ task/ seed).

Robust to Visual Distractors

We subject RRL and FERM to various kinds of visual distractors like change in color, lightning conditions, introduction of a random object, etc to test the generalization performance of the agent. RRL(Ours) consistently outperforms FERM in these tests; features learned by FERM are task specific, therefore they struggle to generalize to the distributional shift induced by the visual distractors. On the other hand, RRL features are task independent, general, and more robust as they are acquired using a diverse range of real world scenes and objects.

RRL

FERM

Different Representational choices

Is Resnet lucky?

Effect of different types of Feature extractor pretrained on ImageNet dataset, highlighting that not just Resnet but any feature extractor pretrained on a sufficiently wide distribution of data remains effective.

Influence of Representation

RRL(Ours), using resnet34 features, outperforms commonly used representation(RRL(VAE))learning method VAE. Amongst different Resnet variations, Resnet34 strikes the balance between representation capacity and computational overhead.

Computational Efficiency

Comparison of the computational cost of RRL with Resnet34 i.e RRL(Ours), FERM - Strongest baseline, RRL with Resnet 18, RRL with Resnet 50, RRL(VAE), RRL with ShuffleNet, RRL with MobileNet and RRL with Very Deep VAE baseline.

RRL(Ours) is Five times more efficient the SOTA approach - FERM
Although RRL(VAE) is the cheapest, its performance is quite poor compared to RRL(Ours)
RRL(Ours) that uses Resnet34 strikes balance between compute and efficiency.

Supplementary Materials

Feature visualization for the trained behaviors using Guided Backpropagation
(Springenberg et al., 2015)

Bibliography

@inproceedings{shah2021rrl,
title={RRL: Resnet as representation for Reinforcement Learning},
author={Rutav Shah and Vikash Kumar},
booktitle={International Conference on Machine Learning},
year={2021},
organization={PMLR}

}