Lossless Adaptation of Pretrained Vision Models for Robotic Manipulation

Experiments

We consider Metaworld, Franka-Kitchen and RGB-Stacking task suites. Both Metaworld and Kitchen have been used  previously (Nair et al. 2022) to evaluate fixed ``off-the-shelf"  pretrained visual representations. Hence, for both suites we use the same environments and demonstrations. Overall, we use 5 different environments from the Metaworld and Kitchen task suites. For each environment we use 3 different camera configurations  as provided by (Nair et al. 2022). Additionally, similar to previous work we use 25 demonstrations for each  environment and train a separate policy for  each environment and camera configuration.

While both Metaworld and Kitchen suites consider many tasks, each task often has a very narrow state-space distribution.  For instance, Kitchen tasks use fixed object positions  while MetaWorld tasks have limited position variance only. Hence, we also evaluate our approach on a much more challenging  RGB-stacking suite (Lee et al. 2021). The RGB-stacking tasks involve three geometric objects colored red, green, and blue and the goal is to stack the red object on top of blue object. Object geometries should also be taken into account for a successful stacking behaviour.

Network Architectures

We evaluate the effectiveness of adapters for manipulation tasks in the context of three different network architectures. Specifically, we use normalizer-free networks (NFNet), residual networks (ResNet) and vision transformers (ViT). Among the different architectures within each category we use  NFNet-f0,  ResNet-50 and ViT-B/16. In addition to imagenet pretraining across all three architectures, we also evaluate using ALIGN  for NFNet, BYOL for ResNet and masked auto-encoder (MAE) for ViT.

Number of parameters to be learned for different adapters as well as full finetuning.