Detecting Unreliable Responses in Generative Vision-Language Models via Visual Uncertainty
By: Kiana Avestimehr, Emily Aye, Zalan Fabian, and Erum Mushtaq
Published at ICLR 2025 Workshop on Quantify Uncertainty and Hallucination in Foundation Models.
Link to Paper here.
Abstract: Building trust in vision-language models (VLMs) requires reliable uncertainty estimation (UE) to detect unreliable generations. Existing UE approaches often require access to internal model representations, which may not always be feasible. Black-box methods primarily rely on language-based augmentations, such as question rephrasings or sub-question modules, but the role of visual information in UE remains underexplored. To study this, we investigate a visual contrast approach that perturbs input images by removing visual evidence relevant to the question and measures changes in the output distribution. We hypothesize that for unreliable generations, the output distributions from augmented and unaugmented images remain similar despite the removal of key visual information. We evaluate this method on the A-OKVQA dataset using four pre-trained VLMs. Our results show that visual contrast, even when applied only at the first token, can be as effective as, or even outperform, state-of-the-art black-box methods.
Illustration of the proposed Visual Contrast method.
By: Tuo Zhang, Tiantian Feng, Yibin Ni, Mengqin Cao, Ruying Liu, Kiana Avestimehr, Katharine Butler, Yanjun Weng, Mi Zhang, Shrikanth Narayanan, and Salman Avestimehr
Published at ACL 2025 Findings.
Link to Paper's Github here.
Abstract: Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.Â