Consistently Simulating Human Personas with Multi-turn Reinforcement Learning

Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques

UC Berkeley, University of Washington, Google Research

NeurIPS 2025

arXiv | Code

Motivation

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics—prompt-to-line consistency, line-to-line consistency, and Q&A consistency—that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent, faithful, and trustworthy simulated users.

What is consistency in dialogue?

Persona of a user defined as a stable set of traits and background expressed in behavior and speech in a dialogue interaction
Quantifying consistency of LLM agents locally at utterance level and globally across entire dialogue
Persona unfolds over multiple expressions of behavior, not a single factual check, hence the need for a multi-turn objective

(1) Prompt-to-line Consistency: The ability of the user simulator U_sim to remain consistent with a given persona, strategy, and task description defined in the prompt is the most general sense of consistency to be assessed. Given a base prompt P, a model response R=[r_1,r_2 … r_T], and LLM Judge oracle function as J_LLM (x,y)∈{0,1}

(2) Line-to-Line Consistency: U_sim may introduce new information that aligns with the base prompt but conflicts with new information from prior statements. A conversational agent must be able to integrate new information without contradicting itself as the dialogue progresses. Given dialogue history R_(<t)=[r_1,r_2 … r_(t-1)], model response r_t and LLM Judge oracle function as J_LLM (x,y)∈{0,1}.

(3) Q&A Consistency: This metric assesses whether the agent maintains a consistent representation of its persona and strategy throughout the dialogue. We use LLM-generated Q&A-style probes over both the initial persona prompt P and the evolving dialogue. Given the diagnostic questions Q={q_1, q_2, …, q_k} on the base prompt and LLM Judge oracle function as J_LLM (x,y)∈{0,1}, let a ̂_(t,k) be the answer to the question q_k inferred from the full dialogue history up to turn t and a_k denote the reference answer derived from P:

Do our consistency metrics align with human judgment?

We find that our LLM annotator (LLama-70B-Instruct) demonstrates substantially higher reliability than human raters, achieving an average Fleiss’ kappa of 0.400 across tasks, surpassing the human–human average Fleiss’ kappa of 0.063 in all cases. Similarly, model–human percent agreement averaged 76.73%, exceeding the human–human average of 69.16%. The highest correlation occurs in the Education task (average Fleiss’ kappa of 0.62), whereas Mental Health dialogues show somewhat lower average Fleiss’ kappa of 0.52, despite the high percent agreement rate of 85%. This suggests that in domains where emotional nuance and implied intent play a larger role, consistency is more subjective and difficult to determine. Notably for the prompt-consistency metric, we find a Fleiss’ kappa of 0.453 and pairwise agreement of 88.18%, outperforming human inter-rater agreement, with a low Fleiss’ kappa of 0.259 and pairwise agreement of 74.93%. This supports our decision to adopt prompt consistency as the primary signal for multiturn RL fine-tuning: it captures human intuition of consistency most reliably, while remaining computationally efficient compared to line-to-line and Q&A consistency metrics.

How inconsistent are LLM simulators?

Consistency varies substantially across both models and tasks. Mistral-7B-Instruct achieves the highest overall scores, particularly in open-ended dialogue. Llama-8B-Instruct shows lower consistency, especially on prompt-to-line and Q&A metrics, though its generations are qualitatively more complex—suggesting a tradeoff between generation richness and stability. Task-wise, educational dialogues yield the highest Q&A consistency, likely due to their structured nature, while mental health dialogues show greater variability and prompt misalignment, reflecting the increased ambiguity and emotional nuance of the domain. Line-to-line consistency remains uniformly high across models and tasks, indicating strong local coherence. In contrast, prompt-to-line and Q&A metrics reveal persistent failures in maintaining global persona and belief stability. As such, we prioritize improvements to prompt-to-line consistency in subsequent fine-tuning experiments. Figure 3 presents pairwise agreement between our consistency metrics averaged across models for each domain. In open-ended conversation, we observe strong agreement between prompt and line-to-line consistency, but lower alignment with

Can we improve consistency of dialogue with multi-turn RL?

Multi-turn RL >> SFT when behavioral stability required for more complex domains (e.g. therapy, education) across turns, and not just factual accuracy (e.g. open-ended conversation)

Multi-turn RL substantially increases prompt-to-line consistency across all tasks. As shown in the above, PPO consistently outperforms the baseline Llama-8B-Instruct model, SFT and KTO. Specifically, PPO outperforms the baseline model by +58.5% for the Open-Ended Conversation task, +20.6% for the Education, and +37.6% for the Mental health task. Human evaluation of conversations from fine-tuned PPO model corroborate these improvements. Additionally, we find that prompt-to-line consistency remains high even as dialogue length increases post-PPO fine-tuning, indicating that reinforcement learning helps models preserve persona alignment over extended interactions.

BibTeX

@misc{abdulhai2025consistentlysimulatinghumanpersonas,

title={Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning}, author={Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine and Natasha Jaques}, year={2025}, eprint={2511.00222}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.00222}}

Page updated

Google Sites

Report abuse