Muffin or Chihuahua? 

Challenging Large Vision-Language Models with Multipanel VQA

Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Yang Zhao‡, Xinze Guan, and Xin Eric Wang

†Univeristion of California, Santa Cruz, ‡eBay Inc.

Abstract

Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, our paper introduces Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark that specifically challenges models in comprehending multipanel images. The benchmark comprises 6,600 questions and answers related to multipanel images. While these questions are straightforward for average humans, achieving nearly perfect correctness, they pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) we tested. In our study, we utilized synthetically curated multipanel images specifically designed to isolate and evaluate the impact of diverse factors on model performance, revealing the sensitivity of LVLMs to various interferences in multipanel images, such as adjacent subfigures and layout complexity. As a result, MultipanelVQA highlights the need and direction for improving LVLMs' ability to understand complex visual-language contexts.

Figure 1. Examples of Single-panel  vs. multipanel image VQA. GPT-4V distinguishes muffin and chihuahua in the single-panel image input but struggles with the same content in the multipanel image.

Benchmark Overview

The benchmark consists of two subsets: 

Each image is paired with three distinct question styles Q1, Q2, and Q3.

Figure 2. Overview of MultipanelVQA Benchmark data. 

Leaderboard

Samples of Data and Model Outputs

Analysis Results

Error analysis of MultipanelVQA

Please cite our paper as below if you use our work.

@misc{fan2024muffin,

      title={Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA}, 

      author={Yue Fan and Jing Gu and Kaiwen Zhou and Qianqi Yan and Shan Jiang and Ching-Chen Kuo and Xinze Guan and Xin Eric Wang},

      year={2024},

      eprint={2401.15847},

      archivePrefix={arXiv},

      primaryClass={cs.CV}

}


Have any questions or suggestions? Feel free to reach out at yfan71@ucsc.edu.