ComCLIP: Training-Free Compositional Image and Text Matching

Kenan Jiang* 1, Xuehai He* 2, Ruize Xu 3, Xin Eric Wang 2

 1UC Berkeley, 2UC Santa Cruz, 3Renmin University of China

*equal contribution

Abstract

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching --- a more challenging image and text matching task requiring the model's understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action subimages and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and subimage embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: Winoground, VL-checklist, SVO, and ComVG, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. 

 Results

Method

We disentangle the input image using three independent encoding mechanisms by obeying the rules of encoding object, subject, and predicate respectively. The entity information is introduced to the global embedding of the whole image. Module components from CLIP (vision encoder F (·), text encoder G(·)) are always frozen. The whole framework is training-free.

Questions?

Contact Xuehai He with xhe89@ucsc.edu to get more information on the project

@article{jiang2022comclip,

  title={Comclip: Training-free compositional image and text matching},

  author={Jiang, Kenan and He, Xuehai and Xu, Ruize and Wang, Xin Eric},

  journal={arXiv preprint arXiv:2211.13854},

  year={2022}

}