Our work focuses on training personalized LLMs in multi-turn conversations. Standard LLM training methods treat all users as a homogeneous group, leading to suboptimal performance for different groups (top left); while an optimal LLM can actively learn about user preferences within the conversation and then adapt to it (top right).
We introduce Intrinsic Motivation in user modeling to multi-turn RLHF. Intuitively, rather than training an LLM only with the end-of-conversation sparse reward, we add an additional turn-based reward that is given by its improvement in belief over the user type after generating an utterance and receiving a response, which guides the LLM to actively learn about user type and then adapt to each user throughout the conversation.
Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward
Yanming Wan*^, Jiaxing Wu*, Marwa Abdulhai, Lior Shani, Natasha Jaques
Google DeepMind University of Washington Google Research University of California, Berkeley
*Equal Contribution ^Work done during internship at Google DeepMind
Curiosity-driven User-modeling Reward as Intrinsic Objective (CURIO)
Intrinsic Reward via User Modeling
Conventional methods for training LLMs struggle to identify achieve personalization for two reasons:
(1) Whether the LLM has successfully personalized the conversation to the user typically can only be evaluated at the end of the conversation, resulting in an extremely sparse reward signal. This sparsity hinders the model's ability to learn which early-stage actions can lead to higher future personalized rewards.
(2) There exists a data imbalance among different user groups within large corpora. As a result, the model tends to learn policies that perform well on the majority group, achieving relatively higher rewards while falling into a local minimum. This discourages further exploration associated with other users.
To address this issue, we propose to introduce Intrinsic Motivation (IM) to train a language model that can actively learn about the user type out of curiosity, and then adapt to the preference of each user. This intrinsic reward is given by the policy's improvement in belief over the user type across the turns. The intuition for this idea is that training the model to acquire information about the user type u will better enable it to optimize the personalized reward R(s,a|u). We leverage a parameterized user model that predicts the probability distribution over user types based on the conversation rollout. This user model can be either trained or prompted depending on the task.
Specifically, the user model takes in the current conversation rollout s_{t+1} after applying a_t and sampling the user response, and outputs a belief b_t over all user types. With this user modeling, we can define intrinsic rewards such as the improvement in accuracy over ground truth, or the entropy reduction of the probability distribution.
The CURIO framework is illustrated in the figure, with four different LLMs involved in training. In each episode, the current policy model (which we are training) engages in a multi-turn conversation with a fixed, simulated environment model, which is meant to simulate a human user. The reward model employed in traditional RLHF evaluates the entire conversation, generating an extrinsic reward provided only at the end of the conversation. In contrast, the user model predicts the probability distribution over user types at each conversational turn, based on the dialogue context up to that point, which are then used to compute the turn-based curiosity reward.
Experiments and Results
CURIO enhances personalization and reduces generalization gap.
We first consider a case where personalization is the main objective of the conversation. We design a new task, Exercise Recommendation, where the agent functions as a health advisor, tasked with recommending personalized fitness strategies tailored to each user. The agent needs to elicit user information and preference through multiple rounds of dialogue before choosing a strategy at the end of the conversation. The agent is only rewarded when its recommendation is aligned with the user's ground-truth strategy, which is decided by the predefined user's profile.
The initial SFT model achieved success rate of 54%. With furthur RL training, the Multi-Turn RLHF baseline increases the rate to 68.5%. With CURIO, the success rate reached up to 87.5%.
During training, we observed that traditional methods are significantly impacted by a generalization gap. We hypothesize that this is because the baseline model personalizes by memorizing mappings from superficial user details to specific strategies seen during training. Our models generalize more effectively to novel users because they are learning how to learn about the user during the conversation--asking informative questions that help distinguish between different user types.
CURIO remains effective when personalization is relevant but not ultimate goal.
With a proper reward choice, CURIO preserves conversation quality.
In many other tasks, however, personalization can improve performance on the task but is only one component of the task rather than the only aim. These tasks are helped by accurate user modeling but usually have a more complicated reward function. We consider the Education Dialogue dataset introduced by Shani et al. [2024], which simulates an educational setting where an LLM agent teaches students a given topic. This dataset is particularly valuable as it incorporates individual student learning preferences, so we build a protocol for personalization evaluation based on it. Specifically, all the models are evaluated across: (1) personalization, assessing the agent's ability to tailor conversations to user's ground-truth preference, and (2) conversation quality, determining whether personalization was achieved without compromising coherence and overall quality. Automated evaluation was performed using Gemini to compare a pair of conversations generated by two models.
All the accuracy-based intrinsic rewards significantly improve personalization ability within the conversations.
The CURIO models with PBRS have a relatively smaller negative impact on the conversation quality. Among them, the DifflogAcc reward is rated as significantly higher quality than baseline and all other intrinsic rewards.
Overcoming Reward Hacking: Ungrounded rewards lead to "controlling behavior".
In order to train multi-turn RL models at scale, we use LLMs to simulate the user and act as reward models. A limitation of this approach is that RL-based methods which attempt to maximize rewards can sometimes engage in "reward hacking", exploiting weaknesses in either the reward model or user model to obtain higher rewards in ways that do not correspond to desirable behaviors. Interestingly, we find that this can include manipulative behavior, such as attempting to convince the user model to adopt particular preferences that are easier for the policy to cater to. For example, with an extrinsic reward model that is not user-conditioned, the baseline multi-turn RL model adopts a merging teaching style called "role-playing video", which is not one of the true learning styles, but results in a spuriously high extrinsic reward.
Similarly, when using the entropy-based intrinsic rewards that are not "grounded" (i.e., are only based on the classifier's certainty and do not make use of a ground-truth user label), we observe that the models perform really well on one particular user type, but badly on the other. For example, even though the student has expressed preference in story-telling, the teacher insists on hands-on style. We attribute it to the emergence of "controlling behavior", where the policy attempts to convince the classifier that the user belongs to one particular type, rather than actually adhering to the ground-truth type.
Generally speaking, by using our proposed accuracy-based rewards, which require predicting the actual user type rather than tricking the user classifier, we can resolve these issues and attain better performance.
Theoretical Details
Formulating Personalized Conversation as User-Conditioned POMDP
In traditional RLHF, a conversational task is commonly formulated as a Markov Decision Process (MDP). To extend this formulation to personalized conversational tasks, we introduce the user type u, which we assume is fixed throughout the conversation. For each user, the transition dynamics and reward function are conditioned on u, meaning that different users may respond differently and provide different preference ratings.
However, the user type is unobservable in most real world settings. Consequently, the problem can be modeled as a Partially Observable Markov Decision Process (POMDP). Although an LLM agent in this environment does not know the ground truth user type initially, it can maintain a belief over the user type and update its belief as it receives more responses from the user. Therefore, we define the belief function at time step t as b_t, which is a probability distribution over all possible user types.
Relationship with Potential-based Reward Shaping
Potential-based Reward Shaping (PBRS) has been extensively studied in traditional RL. This series of work provides insights into how intrinsic rewards can be effectively designed. In particular, the following theorem offers fundamental justification for employing intrinsic rewards of specific forms.
Intuitively, with a better user prediction, the policy can better tailor its actions to achieve higher returns. We discuss the following reward shaping terms in this work. For those which are PBRS, noting that adding an auxiliary reward does not change the optimality, we hypothesize that it just potentially make the policy easier to learn. The other reward functions cannot guarantee the optimality, but they are also intuitively reasonable intrinsic motivations.
Conclusion
This paper introduced a novel framework CURIO for enhancing personalization in LLMs on multi-turn conversation tasks. By leveraging user modeling and integrating intrinsic rewards into multi-turn reinforcement learning, our approach encouraged the LLM to actively learn user traits and adapt its responses accordingly. Experiments across two distinct domains demonstrate that CURIO improves personalization in multi-turn conversations across various scenarios, whether personalization is the ultimate or a partial goal, while maintaining conversation quality.
@article{wan2025curio,
author = {Wan, Yanming and Wu, Jiaxing and Abdulhai, Marwa and Shani, Lior and Jaques, Natasha},
title = {Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward},
journal = {arXiv preprint arXiv:2504.03206},
eprint = {2504.03206},
year = {2025},
}