Moral Dilemma Simulation

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Xinyi Yang1 2 3 4, Chenheng Xu1 2 3 4, Weijun Hong1 2 3 4 5, Ce Mo6 ✉, Qian Wang2 4, Fang Fang2 4 ✉, Yixin Zhu2 1 3 4 ✉

✉corresponding authors

1Institute for Artificial Intelligence, Peking University

2School of Psychological and Cognitive Sciences, Peking University

3State Key Lab of General Artificial Intelligence, Peking University

4Beijing Key Laboratory of Behavior and Mental Health, Peking University

5Yuanpei College, Peking University

6Department of Psychology, Sun Yat-sen University

Modality disagreement of current VLM

A vision–language model may produce diametrically opposed answers to the same moral dilemma depending on the input modality. For instance, when presented with a textual description of a scenario involving whether to report a friend for cheating on an exam, Gemini responds affirmatively. However, when shown an image depicting the same scenario, it responds negatively. Thus, identical dilemmas can yield opposite outcomes across modalities.

Limitations of current benchmarks

Existing moral evaluation benchmarks are poorly equipped to investigate this modality gap. They predominantly present moral scenarios as text-only questionnaires, overlooking how visual cues fundamentally shape moral judgment. Moreover, they lack the systematic experimental control needed to isolate which variables drive model behavior. Controlled manipulation is standard in moral psychology, but such hand-crafted designs cannot scale to the diversity required for comprehensive AI evaluation.

Moral Dilemma Simulation

We address both limitations by introducing the Moral Dilemma Simulation (MDS), a multimodal moral benchmark grounded in Moral Foudantion Theory (MFT), which organizes moral cognition around five core dimensions: Care, Fairness, Loyalty, Authority, and Purity. Rather than a static dataset, MDS is a generative engine that presents each dilemma through both a textual description and a rendered visual scene in a sandbox game style. Crucially, it supports orthogonal control over conceptual variables (intentionality, personal force, self-benefit) and character variables (demographic attributes, relationship factors), enabling causal-level analysis of moral decision-making at scale in modern AI settings.

Experiment results

Applying a tri-modal diagnostic protocol of text, caption, and image modes, we identify a significant modality gap in current VLMs. Compared to text-only scenarios, visual inputs cause models to (a) lose sensitivity to numerical stakes in utilitarian trade-offs, responding indiscriminately regardless of lives saved; (b) prioritize self-interest over loyalty to friends; and (c) collapse hierarchical social values, treating demographically distinct groups as equivalent. Together, these failures reveal (d) a fundamental vulnerability introduced by visual distraction: visual inputs bypass language-level safety filters, producing misaligned outputs that text-based alignment cannot prevent.

We hope MDS and the empirical findings it yields can inform the development of more robust, modality-agnostic alignment approaches.

More materials can be found in our paper and code repo. We have also released our generated dataset. If you find MDS useful, please cite us 🥹.

Paper

Code

Data

Video

demo_long.mp4

@article{xu2024preference,

title={Learning to Plan with Personalized Preferences},

author={Xu, Manjie and Xinyi Yang and Wei Liang and Chi Zhang and Yixin Zhu},

year={2024}

}

Page updated

Google Sites

Report abuse