Muffin or Chihuahua?
Challenging Large Vision-Language Models with Multipanel VQA
Yue Fan†, Jing Gu†, Kaiwen Zhou†, Qianqi Yan†, Shan Jiang‡, Ching-Chen Kuo‡, Yang Zhao‡, Xinze Guan‡, and Xin Eric Wang†
†Univeristion of California, Santa Cruz, ‡eBay Inc.
Abstract
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, our paper introduces Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark that specifically challenges models in comprehending multipanel images. The benchmark comprises 6,600 questions and answers related to multipanel images. While these questions are straightforward for average humans, achieving nearly perfect correctness, they pose significant challenges to the state-of-the-art Large Vision Language Models (LVLMs) we tested. In our study, we utilized synthetically curated multipanel images specifically designed to isolate and evaluate the impact of diverse factors on model performance, revealing the sensitivity of LVLMs to various interferences in multipanel images, such as adjacent subfigures and layout complexity. As a result, MultipanelVQA highlights the need and direction for improving LVLMs' ability to understand complex visual-language contexts.
Figure 1. Examples of Single-panel vs. multipanel image VQA. GPT-4V distinguishes muffin and chihuahua in the single-panel image input but struggles with the same content in the multipanel image.
Benchmark Overview
The benchmark consists of two subsets:
the Synthetic subset with artificially generated multipanel images, and
the Real-world subset featuring multipanel images sourced from actual posters and web screenshots.
Each image is paired with three distinct question styles Q1, Q2, and Q3.
Figure 2. Overview of MultipanelVQA Benchmark data.
Leaderboard
Average accuracy of LVLMs answering questions based on corresponding multipanel images.
Two proprietary models, GPT-4V and Gemini Pro Vision, demonstrate the best overall performance.
However, there is a notable gap between model and human performance.
Samples of Data and Model Outputs
Please cite our paper as below if you use our work.
@misc{fan2024muffin,
title={Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA},
author={Yue Fan and Jing Gu and Kaiwen Zhou and Qianqi Yan and Shan Jiang and Ching-Chen Kuo and Xinze Guan and Xin Eric Wang},
year={2024},
eprint={2401.15847},
archivePrefix={arXiv},
primaryClass={cs.CV}
}