Comics are a uniquely compelling visual storytelling medium, blending images and text to convey intricate narratives. Unlike other visual media such as photographs or videos, comics rely on discrete panels, stylized characters, and implicit transitions that require readers to infer context and causality. The interplay between visual elements, speech bubbles, and captions enables rich, multimodal communication, making comics both a fascinating artistic domain and a challenging testbed for AI. Beyond entertainment, comics are used in education, journalism, and digital humanities, highlighting their broad cultural and communicative significance.
Despite rapid progress in vision-language models, AI systems continue to struggle with comic understanding. Unlike natural images, which depict real-world scenes, or structured documents, which follow rigid layouts, comics present highly abstract and diverse representations. Tasks such as panel sequencing, entity tracking, and cross-panel reasoning remain difficult for even state-of-the-art models. Current approaches often fail to handle the complexities of character consistency across panels, implicit storytelling gaps, and multimodal fusions of text and imagery. Unlike videos, where object motion provides continuity, comics require AI to infer relationships between disjointed frames, making traditional visual reasoning techniques ineffective.
These challenges stem from fundamental limitations in existing AI methodologies. Most vision-language models are trained on large-scale datasets of real-world images and text, making comics an extreme case of domain shift. Data scarcity exacerbates the problem, as annotated comic datasets are limited due to copyright constraints and the high cost of manual labeling. Additionally, comics vary drastically in artistic style, panel arrangement, and cultural conventions, making it difficult to develop models that generalize across different formats. Without standardized benchmarks, progress in this domain has been slow and fragmented.
This workshop will bring together researchers from computer vision, cognitive science, and multimedia analysis to advance AI-driven comic understanding. Through invited talks, discussions, and presentations, we will explore new methodologies for multimodal reasoning, self-supervised learning, and dataset curation. A central component of the workshop will be the Comics Visual Question Answering (CVQA) Challenge, which introduces a benchmark for evaluating AI comprehension of comics. By fostering collaboration across disciplines, this workshop aims to push the boundaries of multimodal AI, making comics a valuable testbed for next-generation vision-language models.