IKIWSI: Visual Debugging for Large Multi-Modal Model Inconsistencies
(DIS'25 Best Paper Honorable Mention)
Problem:
Recent open-vocabulary large multimodal (LMM) vision-language models like GPT-4V can process and generate text, images, and videos. However, unlike humans, their outputs often lack common sense and can be inconsistent across modalities.
For example, in multi-label video object recognition, a model might identify an object in one frame but miss it in the next, even if the frames are nearly identical. Automated metrics like F1 or average precision (AP) cannot capture such inconsistencies, making human evaluation essential. To support this, we need a simple visualization tool that allows anyone, regardless of machine learning or computer vision expertise, to assess whether the model is performing reliably.
Solution:
IKIWSI (acronym for I-Know-It-When-I-See-It), shown in the top-left figure, addresses this challenge of evaluating large multimodal models in multi-object video recognition tasks. IKIWISI is a lightweight and intuitive tool that visualizes inconsistencies in model predictions across video frames. IKIWISI works as follows:
The user selects specific objects of interest from the object dropdown (E) and videos from the video dropdown (C).
When the user chooses the video, the image container (D) is populated with up to 16 keyframes from the video, which the user can zoom in and inspect.
IKIWISI represents the model's predictions in a binary heatmap (F), where each row corresponds to an object and each column to a video frame.
The heatmap allows users to quickly decide on the model's performance by examining the high-level patterns (shown in the bottom left figure) generated by the heatmap.
Designed to be low-effort and high-clarity, IKIWISI requires no prior machine learning or computer vision expertise. Its interface supports effortless comparison and filtering, empowering users to spot unreliable patterns and assess model reliability using only visual perception and common sense.
Outcome:
In a user study with 15 users, the heatmap visualization enabled our participants to identify subtle inconsistencies (e.g., models flipping answers across turns or hallucinating details).
Participants' rating of model performance correlated strongly with the model's actual performance (F1).
Participants inspected only a tiny fraction of the frames to judge a model, thanks to the high-level heatmap patterns.
Evaluations showed that even non-technical users could identify inconsistency patterns and contribute to model assessment workflows.
Impact:
IKIWSI is among the first tools to enable intuitive inconsistency diagnosis across LMM outputs without requiring code or model access. It supports:
Faster and more inclusive evaluation workflows for researchers, QA teams, and designers.
Integration into model audit pipelines and prompt refinement tools.
Reliable model selection and deployment decisions by enabling users to identify patterns and anomalies with minimal manual effort.