Preference Learning Using Summarization (PLUS)
In the real world, people have diverse and potentially conflicting preferences for how they interact with AI assistants. We propose an RLHF reward modeling approach that captures diverse user preferences from (both labeled and unlabeled) historical conversations using an RL-finetuned summarizer.
Abstract
As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same.
We present a novel framework, Preference Learning Using Summarization PLUS, that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley–Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11–77% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25% improvement over the best personalized reward model technique used for RLHF; (2) personalization with strong proprietary models like GPT-4 without further fine-tuning (e.g., PLUS-summary-conditioned responses achieved a 72% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.
PLUS achieves pluralistic alignment by learning a user-conditioned reward model.
Summarizer (pi) is trained using the reward model's prediction signal.
Reward model (r) is trained using the generated summaries (z).
PLUS learns to summarize important information about the user using reinforcement learning fine-tuning and conditions the reward model on these user summaries. Both are co-adaptively trained online.
R1: How does PLUS compare to existing pluralistic reward models?
We evaluate PLUS against existing reward modeling baselines on three pluralistic benchmark datasets:
Pets (cat versus dog lovers),
Ultrafeedback P2 (helpfulness versus honesty),
Ultrafeedback P4 (helpfulness, honesty, truthfulness, and instruction-follwing)
(Used 3B instruct model of the same model family as a summarizer.)
R2: Why learning to summarize is important?
Naive summarizer (PLUS-untrained) can miss which aspects of user preferences matter (e.g., user likes either cats or dogs, but untrained summarizer focuses on other aspects of user's preferences that are irrelevant to the downstream prediction task)
In-context learning (ICL) under-performs compared to PLUS, especially when predicting for new users. In theory, ICL can enable personalizing by using the entire conversation history as z but underperforms due to increased context lengths.
(Training data includes cat and dog lovers, but at test time, the models are evaluated on bird and rabbit lovers.)
R3: Can PLUS-generated summaries enable personalization of existing models without further fine-tuning?
We evaluate PLUS on a heterogeneous real-user dataset, PRISM (Kirk et al., 2024), which contains open-ended conversations from 1,500 participants across 75 countries and 20 LLMs. Each user engages in up to six conversations on topics of their choice, loosely guided by one of three themes: unguided (free topic), value-guided, or controversy-guided.
We evaluate the effectiveness of PLUS-generated summaries in two ways: (1) conditioning LLM-as-a-judge, and (2) guiding response generation. In both cases, we compare the strong proprietary (GPT-4o and GPT-4.1) models' performance on the given task with and without PLUS-generated summaries about the users.
Left figures show the prediction accuracy of LLM-as-a-judge in predicting the user's preferred response with and without PLUS-generated summaries about the user.
Right figure shows the win rate of response generation with and without the summaries. GPT-4o and GPT-4.1 are used to generate responses, and the winning response is judged by the "oracle" PRISM reward model (reward model conditioned on the user's self-stated preferences).