Lossless Adaptation of Pretrained Vision Models for Robotic Manipulation

Mohit Sharma1, Claudio Fantacci2, Yuxiang Zhou2, Skanda Koppula2, Nicolas Heess2, Jon Scholz,2, Yusuf Aytar2

1Carnegie Mellon University, 2DeepMind

Existing works adapt preretrained general purpose visual models (a) through full end-to-end fine-tuning as shown in (b), which looses the original capabilities of the model; or adapting frozen pretrained models through top-adapters as shown in (c), which often fails to achieve optimal control performance.  However, by introducing additional mid-level and bottom-level adaptation as in (d), we still maintain the existing perceptual capabilities while approaching the full fine-tuning performance as empirically shown on right.

Parameter Efficient Lossless Adaptation

We show parameter efficient lossless adaptation over many network architectures and pretraining methods.

Abstract

Recent works show that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks.  While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce lossless adaptation to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in three manipulation task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.

Approach

Our main aim is to use fixed pretrained visual models but adapt their representations for improved downstream  control task performance. To achieve this we use parameter efficient adapter modules that can be inserted in appropriate locations throughout the deep network. These non-pretrained adapter modules are the only set of parameters that are updated during downstream policy learning. While a common approach is to use fixed pretrained visual models with a learned policy head (i.e. top-adapter) and train for the downstream control task.  However, such an approach cannot adjust the low-level perception and mid-level abstraction for the downstream control task.

Adapter Modules: Adapter modules are light-weight neural modules that can be inserted at  different layers of a pretrained deep neural network. Prior works have explored adapter modules for transfer learning, wherein adapter modules are inserted at each layer of a pretrained deep network and only these adapters are updated at the fine-tuning stage (Houlsby et al. 2019). Overall, adapter modules have two important properties,  1) they are lightweight, i.e., they have fewer parameters compared to the original network, and 2) they keep the initialization provided by the pretrained deep network.

Adapters for convolutional architectures (e.g. NFNets, ResNets)

Adapters for transformer based architectures (e.g. ViTs)

How to use adapter modules?


While we can add adapter modules through all layers of the pretrained network, such a choice is highly parameter inefficient and redundant especially for large networks with many layers. We coarsely categorize the network layers based on their functional forms as bottom, middle and top sections as visualized below.

Bottom layer directly uses the raw images as input. In scenarios, where there is a mismatch between the downstream task's image observations and the pretrained bottom layer feature statistics, the downstream task performance can be sub-optimal. Such scenarios are common for downstream manipulation tasks, since there exists a significant domain gap between the data distribution  of pretrained vision models (often in-the-wild data) and standard table-top settings with much closer and non-canonical camera views.

Middle category which contains most of the fixed pretrained network (~90%) weights,  is used to extract the appropriate input abstraction. However, these network weights are trained on visual learning tasks which often focus on  semantic understanding (e.g. image classification) instead of spatial and causal understanding which are important for control.

Top category uses the spatial representation from the middle category  as input and outputs the robot action. This high dimensional spatial representation (size ~20K) is converted  into a smaller representation ~2K either via average/max pooling or by down-projecting using  1x1 convolutions or a small shared MLP. Finally, this smaller representation can be used to directly output the action using a linear policy head.

Experiments

We consider Metaworld, Franka-Kitchen and RGB-Stacking task suites. Both Metaworld and Kitchen have been used  previously (Nair et al. 2022) to evaluate fixed ``off-the-shelf"  pretrained visual representations. Hence, for both suites we use the same environments and demonstrations. Overall, we use 5 different environments from the Metaworld and Kitchen task suites. For each environment we use 3 different camera configurations  as provided by (Nair et al. 2022). Additionally, similar to previous work we use 25 demonstrations for each  environment and train a separate policy for  each environment and camera configuration.

While both Metaworld and Kitchen suites consider many tasks, each task often has a very narrow state-space distribution.  For instance, Kitchen tasks use fixed object positions  while MetaWorld tasks have limited position variance only. Hence, we also evaluate our approach on a much more challenging  RGB-stacking suite (Lee et al. 2021). The RGB-stacking tasks involve three geometric objects colored red, green, and blue and the goal is to stack the red object on top of blue object. Object geometries should also be taken into account for a successful stacking behaviour.

Network Architectures

We evaluate the effectiveness of adapters for manipulation tasks in the context of three different network architectures. Specifically, we use normalizer-free networks (NFNet), residual networks (ResNet) and vision transformers (ViT). Among the different architectures within each category we use  NFNet-f0,  ResNet-50 and ViT-B/16. In addition to imagenet pretraining across all three architectures, we also evaluate using ALIGN  for NFNet, BYOL for ResNet and masked auto-encoder (MAE) for ViT.

Number of parameters to be learned for different adapters as well as full finetuning.


Results

Fixed Pretrained Features vs Adapter Representations

First, we show that while fixed off-the-shelf  representations (without any adaptation) are useful, they can be highly sub-optimal for the given downstream task.  To show this we compare the fixed representations extracted using pretrained weights (Pretrained Feat.) obtained via supervised imagenet pretraining and compare them with full fine-tuning  (Full FT). Figure 1 (below) compares these across all task suites. For a fixed comparison with previous works Figure 1 reports results for the ResNet-50 model,  since previous works only evaluate the ResNet architecture. As seen below, fixed off the shelf-representations  are comparatively much worse across all environment suites.  For Metaworld, Kitchen and RGB-Stacking suites the relative change in performance is around 20%, 30% and 100% respectively. Also, for the RGB-stacking suite the mean performance is much lower 14%,  which shows that fixed pretrained representations become significantly less effective for challenging manipulation tasks.

Fixed Pretrained representations are suboptimal.

Figure 1) Fixed Pretrained representations are suboptimal.

Fixed off the shelf representations are useful but become increasingly sub-optimal for more complex manipulation tasks such as RGB Stacking. This holds across different types of pretrained representations.

Adapter Representations can match Full Finetuning

We now show that our proposed adapters can match full-fine-tuning performance for  downstream manipulation tasks without losing any existing information. We compare  full fine-tuning (Full FT.) approaches with our adapters (Adapters) as well as fixed pretrained features (Pretrained Feat.) For these and future results we report metrics using the more robust mean statistic. Additionally, for task suites with limited state-space distributions, i.e., Metaworld and Franka-Kitchen, we avoid using  any proprioceptive information (see main paper's Appendix for results with proprioceptive).  This allows us to robustly verify the visual representation and avoids any proprioceptive information leakage which can  allow the robot to solve the task even without using the visual features.

Lossless Adaptation of Pretrained Visual Features

Figure 2: Mean success rate comparisons between using fixed pretrained features, adapters and full fine-tuning across all three different environments with three different architecture choices.

Figure 2 (above) shows the results for each task suite and network architecture combination. For the adapter results we report results with bottom, middle and top adapters.  As before, for a fair comparison we use top adapters for all other approaches.  As seen above, our parameter efficient adapters can closely match full fine-tuning performance  across all environment suites and architectures. Most noticeably, for both Franka-Kitchen and RGB-Stacking tasks, our use of adapters is able to exactly match  (average difference in performance < 2%) the performance derived from full fine-tuning (Full FT).  While there exists a slightly larger gap for the metaworld environments -- average performance difference ~6%. However, compared with directly utilizing the fixed pretrained features, we see a huge performance increase of around  30% averaged over all tasks and architectures.  Additionally, we see a performance increase across all architectures. 

Adapters with Different Pretrained Representations

We now show that our proposed adapters give similar benefits with pretrained weights obtained from  vastly different pretraining (pretext) tasks. For NFNet we use CLIP pretraining, for ResNet we use BYOL and for ViT we use masked auto-encoder (MAE). Figure 3 (below) plots the result for each of these architectures across all three task suites.  The X-axes shows the number of train parameters. The bottom left points in each of the above plots indicate the performance of fixed pretrained features. While top right points show full fine-tuning's performance. On the other hand, the solid lines indicate the performance improvements on inserting bottom and middle adapters (top adapters are used for all approaches).

As seen in Figure 3, for all tasks and pretrained weights adapters are able to closely match the performance of full fine-tuning approach,  while only registering a very marginal increase in the number of trainable parameters. Additionally, comparing MetaWorld results in Figure 3 (Left figure) and Figure 1 (above) we see that while there exists a minor gap between adapters and full-FT with imagenet-supervised weights ~6%, this gap reduces significantly for self-supervised pretraining weights. This advantage of MAE features for control is also  observed in Xiao et al. 2022.  Additionally, similar to (Radosavovic et al. 2022),  we also find that CLIP pretrained weights (with top adapters only) can perform poorly.  For instance, they get < 5% on RGB-stacking tasks.  However, using adapters the performance significantly improves ~50% and closely matches full fine-tuning performance.  Importantly, the adapted representations match the performance of more performant models (e.g. MAE).

Figure 3: Results with different pretraining initializations (for 3 different  models) all 3 environments -- NFNet with CLIP, ResNet with BYOL and ViT with MAE. Bottom Left: points plots performance of fixed pretrained features with top adapters. Top Right: points plot full fine-tuning performance (with top adapters). Solid Lines: indicate adapter performance with adapters first added to bottom and then middle layers.

Real World Results -- Sim2Real

We also investigate if large scale visual pretraining combined with our use of adapters can allow  for sim2real transfer. Prior works that utilize fixed pretrained vision models for real robot tasks often only evaluate on tasks requiring simple motions (reach/grasp) and almost always train the policy on real robot data (Shridhar et al. 2022, Nair et al. 2022). By contrast, we show results for sim2real transfer, i.e,  we use no extra real-world robot data. We use the more challenging RGB-stacking suite for evaluation.  We evaluate sim2real transfer both with and without using any visual domain randomization data. We report results using 100 rollouts for each policy and object triplet.

Figure 4 (below) shows results for one convolutional (NFNet) and one transformer (ViT) architecture. From the above results we see that ViT based policies perform much more poorly compared to NFNet policies.  For instance, a fully fine-tuned ViT policy is completely unable to solve any task.  We believe one reason for this is the high learning capacity of transformer based models which allows it to quickly learn on the given task and lose any prior information,  thus making real world transfer challenging.  However, using adapters instead of fine-tuning is able to achieve non-zero performance.  While the average performance is poor 10%, it is able to achieve 24% success on the easier setting (triplet 4, see paper).

We also see that NFNet based policies  perform much better than ViTs across different training settings. For full fine-tuning and adapter approaches NFNet policies can achieve 35% and 24% success rate,  while training from scratch only achieves 5% and fixed pretrained features do not result in any successes.  Finally, we also evaluate the NFNet policies using visual domain randomization data (DR rows). These policies show much superior performance ~53% for our adapters and full fine-tuning, and closely match the policy performance in simulation.

Figure 4: Sim2Real results for RGB-Stacking without using any visual domain randomized  data for learning the manipulation task policy.

Figure 5: Sim2Real results for RGB-Stacking using visual domain randomized (DR) data for learning the manipulation task policy.

Qualitative Results for NFNet based Adapter policies

Below we show qualitative results for sim2real performance for our approach without using any visual domain randomization data.

Triplet 1

Triplet 1

Triplet 2

Triplet 2

Triplet 3

Triplet 3

Triplet 4

Triplet 4

Triplet 5

Triplet 5

BibTeX

@inproceedings{sharmalossless,

  title={Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation},

  author={Sharma, Mohit and Fantacci, Claudio and Zhou, Yuxiang and Koppula, Skanda and Heess, Nicolas and Scholz, Jon and Aytar, Yusuf},

  booktitle={The Eleventh International Conference on Learning Representations}

}