Xinyi Yang1 2 3 4, Chenheng Xu1 2 3 4, Weijun Hong1 2 3 4 5, Ce Mo6 ✉, Qian Wang2 4, Fang Fang2 4 ✉, Yixin Zhu2 1 3 4 ✉
✉corresponding authors
1Institute for Artificial Intelligence, Peking University
2School of Psychological and Cognitive Sciences, Peking University
3State Key Lab of General Artificial Intelligence, Peking University
4Beijing Key Laboratory of Behavior and Mental Health, Peking University
5Yuanpei College, Peking University
6Department of Psychology, Sun Yat-sen University
A vision–language model may produce diametrically opposed answers to the same moral dilemma depending on the input modality. For instance, when presented with a textual description of a scenario involving whether to report a friend for cheating on an exam, Gemini responds affirmatively. However, when shown an image depicting the same scenario, it responds negatively. Thus, identical dilemmas can yield opposite outcomes across modalities.
Existing moral evaluation benchmarks are poorly equipped to investigate this modality gap. They predominantly present moral scenarios as text-only questionnaires, overlooking how visual cues fundamentally shape moral judgment. Moreover, they lack the systematic experimental control needed to isolate which variables drive model behavior. Controlled manipulation is standard in moral psychology, but such hand-crafted designs cannot scale to the diversity required for comprehensive AI evaluation.
We address both limitations by introducing the Moral Dilemma Simulation (MDS), a multimodal moral benchmark grounded in Moral Foudantion Theory (MFT), which organizes moral cognition around five core dimensions: Care, Fairness, Loyalty, Authority, and Purity. Rather than a static dataset, MDS is a generative engine that presents each dilemma through both a textual description and a rendered visual scene in a sandbox game style. Crucially, it supports orthogonal control over conceptual variables (intentionality, personal force, self-benefit) and character variables (demographic attributes, relationship factors), enabling causal-level analysis of moral decision-making at scale in modern AI settings.
Applying a tri-modal diagnostic protocol of text, caption, and image modes, we identify a significant modality gap in current VLMs. Compared to text-only scenarios, visual inputs cause models to (a) lose sensitivity to numerical stakes in utilitarian trade-offs, responding indiscriminately regardless of lives saved; (b) prioritize self-interest over loyalty to friends; and (c) collapse hierarchical social values, treating demographically distinct groups as equivalent. Together, these failures reveal (d) a fundamental vulnerability introduced by visual distraction: visual inputs bypass language-level safety filters, producing misaligned outputs that text-based alignment cannot prevent.
We hope MDS and the empirical findings it yields can inform the development of more robust, modality-agnostic alignment approaches.
More materials can be found in our paper and code repo. We have also released our generated dataset. If you find MDS useful, please cite us 🥹.
@article{xu2024preference,
title={Learning to Plan with Personalized Preferences},
author={Xu, Manjie and Xinyi Yang and Wei Liang and Chi Zhang and Yixin Zhu},
year={2024}
}