[CVPR 2023] Improving Zero-shot Generalization and Robustness of Multi-modal Models

Improving Zero-shot Generalization and Robustness of Multi-modal Models

Yunhao Ge*, Jie Ren*, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao ( * =equal contribution)

Contribution 1: CLIP Zero-shot inference failure case analysis

We identified several failure modes for zero-shot ImageNet classification using multi-modal models, and our findings suggest that the text encoder is very sensitive to prompts. To improve the prediction accuracy, prompts need to be better designed.

Contribution 2: Self-consistent zero-shot confidence estimation

We propose a simple yet efficient zero-shot confidence score that is better suited for multi-modal models, based on predictions’ self-consistency under different text prompts and image perturbations.

Contribution 3: Hierarchy-CLIP: Top-down and bottom-up label augmentation using WordNet hierarchy

Top-down: augmenting class names with parent

Bottom-up: augmenting class names with children.


Our method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures

Contact / Cite

Got Questions? We would love to answer them! Please reach out by email! You may cite us in your research as:


@inproceedings{ge2023improving,

  title={Improving Zero-shot Generalization and Robustness of Multi-modal Models},

  author={Ge, Yunhao and Ren, Jie and Gallagher, Andrew and Wang, Yuxiao and Yang, Ming-Hsuan and Adam, Hartwig and Itti, Laurent and Lakshminarayanan, Balaji and Zhao, Jiaping},

  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},

  pages={11093--11101},

  year={2023}

}