Xuehai He 1, Chunyuan Li 2, Pengchuan Zhang 2, Jianwei Yang 2, Xin Eric Wang 1
1UC Santa Cruz, 2Microsoft Research at Redmond
In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task. We formulate efficient model adaptation as a subspace training problem and perform a comprehensive benchmarking over different efficient adaptation methods. We conduct an empirical study on each efficient model adaptation method focusing on its performance alongside parameter cost. Furthermore, we propose a parameter-efficient model adaptation framework, which first selects submodules by measuring local intrinsic dimensions and then projects them into subspace for further decomposition via a novel Kronecker Adaptation (KAdaptation) method. We analyze and compare our method with a diverse set of baseline model adaptation methods (including state-of-the-art methods for pretrained language models). Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across 20 image classification datasets under the few-shot setting and 7 image classification datasets under the full-shot setting.
Results
The figure shows the tradeoff between accuracy and parameter numbers of various model adaptation methods. The results are measured using the vision transformer (ViT-B-224/32) via CLIP pretraining across the average of 20 image classification datasets. Our method places in the topleft corner and achieves the best tradeoff between accuracy and parameter efficiency. The color of points and numbers denote the performance-efficiency (PE) metric (higher is better), defined by:
PE = score∗exp (− log10(# trainable-parameters /M0 + 1))
Our proposed Kronecker Adaptation (KAdaptation) is the SOTA of the Parameter-Efficiency track with both ViT-B/32 and ViT-B/16 at Image Classification in the Wild (ICinW) Challenge on the ECCV2022 workshop.
To improve efficiency, we propose a novel method for adapting the weights of the attention modules in the vision transformer. Rather than directly adapting the pretrained weights, we decompose the update weights into a sum of Kronecker products and further decompose these matrices into the product of low-rank matrices. This decomposition allows us to reduce the parameter space while maintaining performance. Please refer to our paper for more detailed information on this method.
Contact Xuehai He with xhe89@ucsc.edu to get more information on the project
@article{he2022parameter,
title={Parameter-efficient Fine-tuning for Vision Transformers},
author={He, Xuehai and Li, Chunyuan and Zhang, Pengchuan and Yang, Jianwei and Wang, Xin Eric},
journal={arXiv preprint arXiv:2203.16329},
year={2022}
}