Lossless Adaptation of Pretrained Vision Models for Robotic Manipulation

Results

Fixed Pretrained Features vs Adapter Representations

First, we show that while fixed off-the-shelf  representations (without any adaptation) are useful, they can be highly sub-optimal for the given downstream task.  To show this we compare the fixed representations extracted using pretrained weights (Pretrained Feat.) obtained via supervised imagenet pretraining and compare them with full fine-tuning  (Full FT). Figure 1 (below) compares these across all task suites. For a fixed comparison with previous works Figure 1 reports results for the ResNet-50 model,  since previous works only evaluate the ResNet architecture. As seen below, fixed off the shelf-representations  are comparatively much worse across all environment suites.  For Metaworld, Kitchen and RGB-Stacking suites the relative change in performance is around 20%, 30% and 100% respectively. Also, for the RGB-stacking suite the mean performance is much lower 14%,  which shows that fixed pretrained representations become significantly less effective for challenging manipulation tasks.

Fixed Pretrained representations are suboptimal.

Figure 1) Fixed Pretrained representations are suboptimal.

Fixed off the shelf representations are useful but become increasingly sub-optimal for more complex manipulation tasks such as RGB Stacking. This holds across different types of pretrained representations.

Adapter Representations can match Full Finetuning

We now show that our proposed adapters can match full-fine-tuning performance for  downstream manipulation tasks without losing any existing information. We compare  full fine-tuning (Full FT.) approaches with our adapters (Adapters) as well as fixed pretrained features (Pretrained Feat.) For these and future results we report metrics using the more robust mean statistic. Additionally, for task suites with limited state-space distributions, i.e., Metaworld and Franka-Kitchen, we avoid using  any proprioceptive information (see main paper's Appendix for results with proprioceptive).  This allows us to robustly verify the visual representation and avoids any proprioceptive information leakage which can  allow the robot to solve the task even without using the visual features.

Lossless Adaptation of Pretrained Visual Features

Figure 2: Mean success rate comparisons between using fixed pretrained features, adapters and full fine-tuning across all three different environments with three different architecture choices.

Figure 2 (above) shows the results for each task suite and network architecture combination. For the adapter results we report results with bottom, middle and top adapters.  As before, for a fair comparison we use top adapters for all other approaches.  As seen above, our parameter efficient adapters can closely match full fine-tuning performance  across all environment suites and architectures. Most noticeably, for both Franka-Kitchen and RGB-Stacking tasks, our use of adapters is able to exactly match  (average difference in performance < 2%) the performance derived from full fine-tuning (Full FT).  While there exists a slightly larger gap for the metaworld environments -- average performance difference ~6%. However, compared with directly utilizing the fixed pretrained features, we see a huge performance increase of around  30% averaged over all tasks and architectures.  Additionally, we see a performance increase across all architectures. 

Adapters with Different Pretrained Representations

We now show that our proposed adapters give similar benefits with pretrained weights obtained from  vastly different pretraining (pretext) tasks. For NFNet we use CLIP pretraining, for ResNet we use BYOL and for ViT we use masked auto-encoder (MAE). Figure 3 (below) plots the result for each of these architectures across all three task suites.  The X-axes shows the number of train parameters. The bottom left points in each of the above plots indicate the performance of fixed pretrained features. While top right points show full fine-tuning's performance. On the other hand, the solid lines indicate the performance improvements on inserting bottom and middle adapters (top adapters are used for all approaches).

As seen in Figure 3, for all tasks and pretrained weights adapters are able to closely match the performance of full fine-tuning approach,  while only registering a very marginal increase in the number of trainable parameters. Additionally, comparing MetaWorld results in Figure 3 (Left figure) and Figure 1 (above) we see that while there exists a minor gap between adapters and full-FT with imagenet-supervised weights ~6%, this gap reduces significantly for self-supervised pretraining weights. This advantage of MAE features for control is also  observed in Xiao et al. 2022.  Additionally, similar to (Radosavovic et al. 2022),  we also find that CLIP pretrained weights (with top adapters only) can perform poorly.  For instance, they get < 5% on RGB-stacking tasks.  However, using adapters the performance significantly improves ~50% and closely matches full fine-tuning performance.  Importantly, the adapted representations match the performance of more performant models (e.g. MAE).

Figure 3: Results with different pretraining initializations (for 3 different  models) all 3 environments -- NFNet with CLIP, ResNet with BYOL and ViT with MAE. Bottom Left: points plots performance of fixed pretrained features with top adapters. Top Right: points plot full fine-tuning performance (with top adapters). Solid Lines: indicate adapter performance with adapters first added to bottom and then middle layers.