ELVM
Efficient Large Vision Models
CVPR Workshop, June 17, 2024
Overview
Large vision models (LVM) are becoming the foundation for many computer vision tasks. CLIP, DINO, SAM and their variations are effectively used to solve various downstream tasks across different image distributions without any fine-tuning. Diffusion based text-to-image generative models e.g., DALLE and Stable Diffusion, have shown great capabilities in various image generation, editing and enhancement tasks. This puts LVMs as a building block for most computer vision solutions. The substantial representational capacity inherent in transformer architectures, coupled with their huge number of parameters, empowers LVMs to learn from massive datasets through self-supervised learning without requiring manual annotation. However, this comes at a high computational cost that restricts adaptation of LVMs to settings with limited computational resources.
This workshop focuses on enhancing the computational efficiency of LVMs, with the aim of broadening their accessibility to a wider community of researchers and practitioners. We believe that exploring ideas for efficient adaptation of LVMs to downstream tasks and domains without the need for intensive model training and fine-tuning empowers the community to conduct research with a limited compute budget. Moreover, accelerating the inference of LVMs enables adaptation of them for many real-time applications on low compute platforms including vehicles and phones.Â