Evaluating and Reducing Deceptive Dialogue from Language Models
with Multi-turn RL
Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine
Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine
Motivation
Large Language Models (LLMs) interact with hundreds of millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we systematically investigate the extent to which LLMs engage in deception within dialogue.
Contribution
Our contributions are as follows:
1. Measuring deception with five established deception detection metrics and four dialogue scenarios to
evaluate deception in LLMs
2. A novel deception metric—belief misalignment—which quantifies the divergence between a listener’s beliefs
and the true state of the speaker
3. Empirical results quantifying deception in eight widely-deployed LLMs
4. A multi-turn RL pipeline for mitigating deception in LLMs
Our findings reveal this novel deception measure correlates more closely with human judgments than any of the existing metrics we test. Additionally, our benchmarking of 8 state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
Belief Misalignment
We benchmark 4 existing deception metrics, and introduce a novel metric for deception in dialogue settings—belief misalignment, quantifying how much the beliefs of the listener have been influenced by a potentially deceptive action in comparison to the true state. Specifically, it measures the discrepancy between the listener's belief and the actual world state over all time-steps and across all features of the state. We leverage LLMs as evaluators to compute each metric of deception [Zheng et al., 2023].
where R_{misalignment} is the belief misalignment, \phi^(i) is the value of the feature in the speaker's state, k is the number of features in the speaker's state, and n is the total number of utterances. The belief misalignment metric distinguishes itself from other deception measures by tracking how deceptive actions deviate the listener's belief system over features that it cares about, rather than simply measuring falsehoods. Decomposing the state into individual features allows us to observe the specific impact of each deceptive action on the listener’s beliefs about different aspects of the world. It is also a reasonable assumption as it mirrors natural language communication, where speakers convey information about objects or concepts.
Experiments and Results
Belief misalignment correlates most strongly with human judgments.
We compute deception scores using existing deception detection metrics, and ask 20 humans to annotate a subset of dialogues. We compute the Pearson correlation coefficient between each deception metric and human labels, finding belief misalignment to be most correlated with humans across all four tasks.
LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives.
We evaluate the deception rate of widely used LLMs—both base and instruction-tuned—under default settings, with no explicit prompt to be deceptive. To quantify deception, we use belief misalignment that aligns most closely with human judgments. Understanding the default propensity for deception is critical for safe deployment. Many LLM-powered applications, such as chatbots or assistants, rely on default behaviors in the absence of explicit task constraints. If deceptive responses arise even without adversarial prompting, this poses a substantial risk for user trust, downstream decision-making, and responsible AI use.
RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception as large as 43% away from the truth on average.
In order to reduce deception in LLMs, we fine-tune base models with RL to reduce deception via belief misalignment in the Housing task. Specifically, we fine-tune Llama-3.1-8B to maximize task reward, reduce belief misalignment, and a combination of both maximizing task reward and minimizing belief misalignment. We train KTO, Reinforce, and PPO , and evaluate the effectiveness of these RL methods using task utility and belief misalignment, comparing these values with those for the following baselines: Llama-3.1-8B and Llama-3.1-8B-instruct, and training with supervised fine-tuning. Additionally, we compare RL models against baselines of Llama 3-70B-Instruct and gemma-2-27b-it when prompted to be truthful/cooperative, as another method of reducing deception in LLMs.
Our multi-turn RL pipeline for LLMs reduces deceptive behaviors by 77.6%.
Conclusion
This work provides a framework for detecting and mitigating deceptive behavior in LLMs. Our results reveal that deception can occur even under default prompting, and that models often become more deceptive when doing so aligns with achieving task objectives. This suggests that deception is not merely an artifact of poor fine-tuning or adversarial prompts, but can emerge as a goal-directed behavior. One of our key contributions is the introduction of belief misalignment as a metric for deception, which shows the highest correlation with human judgments across tasks. This metric enables more reliable automated evaluation and may serve as a useful signal for future alignment efforts. We also demonstrate that deception can be substantially reduced through multi-turn RL — offering a practical pathway for mitigating undesirable behaviors without requiring manual oversight or adversarial filtering. We hope this framework contributes to broader efforts toward building more trustworthy, goal-aligned AI systems.
@misc{abdulhai2025evaluatingreducingdeceptive,
title={Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL},
author={Marwa Abdulhai and Ryan Cheng and Aryansh Shrivastava and Natasha Jaques and Yarin Gal and Sergey Levine},
year={2025},
eprint={2510.14318},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.14318},
}