Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang, Zheng Li, Qingxiu Dong, Heming Xia, Zhifang Sui

Peking University, The Hong Kong Polytechnic University, Jiangsu Normal University

Deep semantics of an image refer to the underlying meanings that extend beyond the superficial interpretation, probing into the essence of the image. Understanding the deep semantics of images is a manifestation of high-level human intelligence, serving as an important means of exploration from perceptual intelligence to cognitive intelligence.

However, previous efforts in visual understanding mainly focus on surface-level aspects of images, such as object attributes, counting, and relationship reasoning. Earlier attempts on deep semantic are limited in scope, focusing solely on sarcasm or humor, and lack in systematic investigation of the inherent deep semantic.

To address the mentioned limitations and fill the current research gap, we introduce DeepEval📝 - a benchmark for understanding the deep semantics of cartoons across various categories.

Through DeepEval, we hope to promote research in model development, focusing on a deeper understanding of semantics in visual content.

Introducing DeepEval

To evaluate the understanding ability of deep semantics in images across various Large Multimodal Models (LMMs), we introduce DeepEval, a comprehensive evaluation benchmark that includes a human-annotated dataset and three progressive subtasks.

The subtasks in DeepEval are designed with a hierarchical relationship. These three tasks gradually augment the comprehension of images, each task building upon the previous one to deepen the level of understanding. In these three tasks, each question consists of an image and a multiple-choice question with four options. The model is then required to select the option it believes best conveys the description, title, or deep semantics from the four options. The three subtasks are as follows:

Fine-grained Description Selection Task: Evaluating the ability of models to accurately identify the surface-level details of images;
In-depth Title Matching Task: Assessing the capability of models to understand the overall signification of images;
Deep Semantics Understanding Task: Evaluating the ability of models to understand the detailed deep semantic meanings of images.

The data and tasks in DeepEval are derived through a rigorous process involving image collection, annotator recruitment, pre-annotating instruction, qualification test, cross-checking annotation, and options generation (detailed in Section 4 of the paper). During the cross-annotation step, multiple rounds of checking, annotation, and author verification ensure the quality of the annotations, as illustrated in Figure below.

Experiments&Results

In consideration of the strong performance exhibited by LMMs in addressing image comprehension challenges, we evaluate the following LMMs: LLaVA-1.5, MiniGPT-4, mPLUG-Owl2, CogVLM, Qwen-VL, InstructBlip2, Fuyu, GPT-4V.

It can be observed that the accuracy of all evaluated models in deep semantics understanding is significantly lower than their performance in image description, and nearly all of them achieve lower accuracy in deep semantics understanding compared to the in-depth title matching task. This underscores that comprehending deep semantics of images presents a significant challenge for these models, and focusing on the finer details of deep semantics adds further complexity, aligning with our expectations.

Our evaluation also demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description.

Analysis

(1) By analyzing the model’s understanding capabilities in different categories, we can pinpoint strength or weakness of models in specific categories. The performance of different models across categories is illustrated in Figure below, with three radar charts showcasing the model’s ability in interpreting image descriptions, titles, and deep semantics across different categories. The models' understanding of image descriptions is relatively consistent across various categories, whereas its comprehension of the deep semantics of images varies significantly.

(2) We also examine the impact of superficial description on the models' deep semantic comprehension performances. By incorporating its machine-generated superficial description or human-annotated correct superficial description during inference stage, the descriptions of the surface content can inspire and enhance its deep semantics understanding capabilities. The results are shown in Table below, where "DS" stands for "Deep Semantics", "GeneDesc" represents integration of model-generated image descriptions. "AnnoDesc" signifies integration of annotated image descriptions.

(3) We also find that an increase in the number of parameters has a positive impact on the models’ deep semantics understanding capabilities. The experiment results are shown in Figure below, which is the comparison of the average accuracy and variance results between InstructBlip-13B vs InstructBlip 7B and LLaVA-1.5-13B vs LLaVA-1.5-7B.

Acknowledgments

This paper is supported by the National Key Research and Development Program of China (No.2020AAA0106700). The contact author is Zhifang Sui.

Page updated

Google Sites

Report abuse