Mid-Level Visual Representations

Robust Policies via Mid-Level Visual Representations

An experimental study in navigation and manipulation

Bryan Chen*, Alexander Sax*, Francis E. Lewis, Silvio Savarese, Jitendra Malik, Amir Zamir, Lerrel Pinto

[CoRL 2020] [Paper] [Code]

Overview video [5 mins]

Abstract:

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training.

We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.

What are mid-level representations?

Our representations come from neural networks, each trained for a traditional computer vision objective. The image above shows the labels from some of the studied objectives that appear in the performance chart below.

For sim-to-real experiments we consider an even larger set of objectives, the same as was examined in this paper.

Studied Tasks:

REACH

PUSH

PICK + PLACE

POINT NAV

Mid-level representations in practice

Using mid-level vision for RL

SIMPLE: During training, agents are provided with only the encoded feature and not the original image. During training, we froze the feature encoders and didn't update those parameters at all.

NO TRICKS: We found that we needed to do very little hyperparameter tuning. The results for mid-level agents in the chart to the right use the same parameters from Sax et al., which were optimized for agents trained from scratch, for a different task (navigation, not manipulation, and for a different learning algorithm.

In contrast, agents trained from scratch weren't able to learn anything for the harder manipulation tasks, even after trying multiple additional hyperparameter sweeps. We also tried additional reward shaping for agents trained from scratch, which helped for Reach but enough for the harder tasks.

Mid-level-based agents perform better

On three manipulation tasks agents using mid-level vision significantly outperformed agents trained from scratch. In the charts above, hatched bars indicate agents using a mid-level feature. Translucent bars show training performance, and opaque bars show performance on a held-out test set.

On the test environment, agents trained from scratch didn't even outperform blind agents that had no form of perception at all.

* Note that the results shown here use shaped rewards for agents trained from scratch, while the agents using mid-level features were trained with only sparse rewards. The shaped rewards helped the mid-level-based agents as well, but sparse rewards are usually easier to define in practice and in our study we wanted to give every advantage to the baseline methods.

And agents using mid-level vision generalize better, too:

New objects

Agents were trained to pick + place using only red cubes, but mid-level-based agents generalized to different colors (green) and different shapes. Agents using mid-level representations even performed those trained using ground-truth environment state.

New colors

Agents using mid-level vision also generalized to variations of table and background colors unseen during training.

Sim-to-real

Agents using mid-level vision generalized to the real world. We trained agents in a single room from a single building in the Gibson environment, and then tested them on a real Turtlebot2 in multiple unseen real-world buildings with no adaptation period.

Aents using mid-level representations worked significantly better than those trained from scratch (those trained from scratch didn't even outperform a blind agent (with no perception at all) in the test tenvironment. In contrast, the mid-level agents performed about as well on the real Turtlebot as in simulation, and the features which performed well in one environment performed well in the other (Spearman's rho = 0.77).

Complete episode-level statistics and egocentric dashboard videos are available here.

Why does it work?

Why might a mid-level feature (e.g. image ⭢ surface normals) aid in downstream tasks, compared to end-to-end learning? This is an important question as preprocessing observations with the mid-level function 𝜙 might potentially discard information (e.g. color information in surface normals). Good representations preserve important information while discarding spurious details, providing ``invariances" that make the train and test set more similar ( so that 𝜙(train) ≈ 𝜙(test)). When the train and test set become similar, improving performance on the train set generally improves test-time performance, too. The graphic below shows how this might be the case for surface normals (though surface normals also helped forms of generalization besides just colder!)

Really good representations also simplify training when agents use 𝜙(x_train) instead of x_train. By throwing away unimportant information and providing easily decodable outputs (e.g. linearly separable), great representations can reduce sample complexity and boost performance even on the training set, relative to learning tabula rasa. These ideas are illustrated in Figure 2 of the main paper.

Train Environment

Test Environment

Raw Pixels

Raw RGB Input Image

Mid-Level Vision

What an agent using surface normal representations would "see"

Mid-level representations vs. other methods of invariance-learning

The core of the mid-level approach is to learn invariances offline, before RL time. We believe this simplifies RL training, and explains why agents using mid-level representations can be successfully trianed for harder tasks than agents trianed from scratch.

The other main approach to incorporating invariances is known as domain randomization (Tobin et al. 2017). In this paradigm, invariances are learned simultaneously with control by randomizing different axes of variation at training time and then training from scratch. This approach works well for one or two variations, but as the randomization becomes more extreme or multiple axes are varied, the learning problem becomes more difficult and trianing collapses.

The videos below show an example of this on a (relatively simple) training environment where colors are randomized during trianing. In contrast, the agent using mid-level representations is able to learn in this training environment. In fact, even without doing domain randomization, agents using mid-level representations exhibit the desired invariances and outperform the domain-randomization-from-scratch agent in both the training and test domains (see the table below).

Domain randomization agent

50 episodes in the Pick + Place training environment

Mid-level agent

50 episodes in the Pick + Place training environment

Agents trained with different methods of incorporating invariances.

As shown in the table above, agents using the asyncronous (mid-level) approach perform better across train and all test environments. In particular, when tested for Reach on colors not seen in the training, mid-level vision has a success rate of 100% versus 20% when using pixels with domain randomization.

The domain randomization approach trained from scratch also showed signs of learning collapse (100% ⭢ 70% success rate) as the randomization made the learning problem more difficult, a problem also found in other works.