Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He 1, Weixi Feng 2, Tsu-Jui Fu 2, Varun Jampani 3, Arjun Akula 3, Pradyumna Narayana 3, Sugato Basu 3, William Yang Wang 2, Xin Eric Wang 1

 1UC Santa Cruz, 2UC Santa Barbara, 3Google

* Featured in Hugging Face Daily Papers

 Paper  

  Code

Abstract

Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (Discffusion), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing Discffusion with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

Method

 Results


Questions?

Contact Xuehai He with xuehaihe2008@gmail.com to get more information on the project

@article{he2023discriminative,

  title={Discriminative Diffusion Models as Few-shot Vision and Language Learners},

  author={He, Xuehai and Feng, Weixi and Fu, Tsu-Jui and Jampani, Varun and Akula, Arjun and Narayana, Pradyumna and Basu, Sugato and Wang, William Yang and Wang, Xin Eric},

  journal={arXiv preprint arXiv:2305.10722},

  year={2023}

}