How to Adapt Your Large-Scale
Vision-and-Language Model

Abstract

Pre-training large-scale vision and language models (e.g. CLIP) has shown promising results in representation and transfer learning. We investigate the question of how to efficiently adapt these models to downstream tasks. For image classification, linear probes have been the standard for ease of use and efficiency, while for language, other approaches like prompt tuning have emerged. We analyze several fine-tuning methods across a diverse set of image classification tasks across two spectra investigating the amount and similarity of downstream data to that of pretraining one. We find that just tuning LayerNorm parameters is a surprisingly effective baseline across the board. We further demonstrate a simple yet effective strategy that combines LayerNorm-tuning with general fine-tuning methods to improve their performance and benchmark them on few-shot adaption and distribution shift tasks. Finally, we provide an empirical analysis and recommend general recipes for efficient transfer learning of vision and language models.

Fine-tuning Methods

We analyze a variety of fine-tuning methods such as prompt tuning by prepending a learnable prompt, tuning Layer Normalization parameters, inserting adapter and compacter modules in-between the Transformer layers, and using a linear probe on top of visual features. Each labeled approach can be used separately for fine-tuning, while the CLIP model can also be used for inference on a downstream task in a zero-shot manner.

Results

Our results show that LayerNorm tuning is a simple but highly effective baseline across four regimes based on two factors of downstream tasks: amount and distribution of training data.

We find that combining existing fine-tuning methods with LayerNorm tuning in different ways improves performance in all settings.

We recommend general recipes for fine-tuning across different settings such as combining LayerNorm tuning with Compacter networks in the low-data regime, combining LayerNorm tuning with linear probe in the high-data regime, and using just LayerNorm tuning in a general setting.

Source Code

Coming soon!

Page updated

Google Sites

Report abuse

How to Adapt Your Large-ScaleVision-and-Language Model

How to Adapt Your Large-Scale
Vision-and-Language Model