Lossless Adaptation of Pretrained Vision Models for Robotic Manipulation

Real World Results -- Sim2Real

We also investigate if large scale visual pretraining combined with our use of adapters can allow  for sim2real transfer. Prior works that utilize fixed pretrained vision models for real robot tasks often only evaluate on tasks requiring simple motions (reach/grasp) and almost always train the policy on real robot data (Shridhar et al. 2022, Nair et al. 2022). By contrast, we show results for sim2real transfer, i.e,  we use no extra real-world robot data. We use the more challenging RGB-stacking suite for evaluation.  We evaluate sim2real transfer both with and without using any visual domain randomization data. We report results using 100 rollouts for each policy and object triplet.

Figure 4 (below) shows results for one convolutional (NFNet) and one transformer (ViT) architecture. From the above results we see that ViT based policies perform much more poorly compared to NFNet policies.  For instance, a fully fine-tuned ViT policy is completely unable to solve any task.  We believe one reason for this is the high learning capacity of transformer based models which allows it to quickly learn on the given task and lose any prior information,  thus making real world transfer challenging.  However, using adapters instead of fine-tuning is able to achieve non-zero performance.  While the average performance is poor 10%, it is able to achieve 24% success on the easier setting (triplet 4, see paper).

We also see that NFNet based policies  perform much better than ViTs across different training settings. For full fine-tuning and adapter approaches NFNet policies can achieve 35% and 24% success rate,  while training from scratch only achieves 5% and fixed pretrained features do not result in any successes.  Finally, we also evaluate the NFNet policies using visual domain randomization data (DR rows). These policies show much superior performance ~53% for our adapters and full fine-tuning, and closely match the policy performance in simulation.

Figure 4: Sim2Real results for RGB-Stacking without using any visual domain randomized  data for learning the manipulation task policy.

Figure 5: Sim2Real results for RGB-Stacking using visual domain randomized (DR) data for learning the manipulation task policy.

Qualitative Results for NFNet based Adapter policies

Below we show qualitative results for sim2real performance for our approach without using any visual domain randomization data.

Triplet 1

Triplet 1

Triplet 2

Triplet 2

Triplet 3

Triplet 3

Triplet 4

Triplet 4

Triplet 5

Triplet 5