Hao Sun*, Yunyi Shen*, Jean-Francois Ton
University of Cambridge, Massachusetts Institute of Technology, ByteDance Research
hs789@cam.ac.uk, yshen99@mit.edu, jeanfrancois@bytedance.com
* equal contribution
Why can the Bradley-Terry Model be used for reward modeling? What are the underlying principles, assumptions, and logic behind it?
What is truly important when it comes to reward modeling? Besides the Bradley-Terry Model, what other options do we have?
After understanding the Bradley-Terry Model and reward modeling, what aspects of current practices might be improved upon when we revisit them?
Since the 1950s, the BT model and its various improvements have been used for skill evaluation and win rate prediction in board games and various sports. The most straightforward explanation is that, when a team A with ability score r_A faces a team B with ability score r_B, the probability of A defeating B is given by:
In practical scenarios, due to differences in tournament structures, rules, and randomness, using raw scores to predict a team's win rate is often insufficient. It requires the scores to be standardized to account for the inherent randomness in each type of competition. For example, randomness plays a large role in soccer matches, where even weaker teams can sometimes beat stronger ones. Similarly, in card games, which involve incomplete information, winning doesn’t always reflect a player’s true skill level. In contrast, games like chess and Go—where complete information is available—exhibit much less randomness, aside from players’ momentary performance during the match.
The most direct application of the BT model is: given historical match data between different teams, can we assign scores to each team to evaluate their skill levels and then use these scores to accurately predict the outcomes of future matches? We call this parameter estimation, where the scores of each team are the parameters to be estimated.
In theory, even in the absence of randomness, we would need around NlogN comparisons to rank the teams. Considering randomness, the best theoretical result, given by Han et al. [1], suggests that at least Nlog^3N comparisons are required for reasonably accurate estimates.
Now, let’s look at two examples where the BT model is applied in the context of Large Language Models (LLMs).
The first example is LLM Chatbot Arena, also known as the LLM ladder. In this setup, different LLMs are considered the “players.” In each match, two LLMs compete, and users judge the quality of their responses to determine the winner. In LMSYS's Arena, there have been over 2,000,000 matches comparing more than 150 LLMs. On average, each pair of LLMs has been compared over 26,000 times. Here, N=150, and 26000≫N(logN)^3≈1500. As a result, LMSYS can provide 95% confidence intervals for each LLM's score.
On the other hand, in the alignment or reinforcement learning from human feedback (RLHF) process for LLMs, the BT model is used to convert pairwise preference annotations into scores. Here, each prompt-response pair is treated as a player (although one could also view two responses for the same prompt as two players). When we have N prompts, there are 2N responses. The annotators label these 2N responses, which is equivalent to generating the results of N matches. But here's the problem: with N matches, can we reliably score 2N players? After all, N≪2N(log2N)^3.
In fact, this is not the core issue faced in LLM alignment. In sports, even if two teams haven’t played many matches, we can predict match outcomes using certain features (e.g., average height, age, major tournament experience, market value, or average sea cucumber consumption). The focus here shifts from parameter (ability) estimation to outcome prediction. This has been extensively studied in the history of the BT model, and in the literature, it's referred to as Bradley-Terry regression.
Following this line of thinking, our paper addresses the specific scenario of LLM alignment and provides a convergence proof for applying a Siamese MLP structure to implement BT regression.
When we talk about preference as a match and attempt to model the outcome using the Bradley-Terry (BT) model, what assumptions are we making?
In an answer I wrote a year ago, I partially addressed this question: different responses are players, and preferences represent the match outcomes. However, one aspect I hadn’t fully figured out at that time was: Where does the randomness in these matches come from?
One self-consistent explanation is the one I mentioned before. If we assume that each player's "performance" follows a Gaussian distribution around their true skill level, and further assume that this Gaussian distribution has the same variance for all players, then the players’ scores can be standardized by this variance. This would lead us to an erf (error function) version (instead of tanh) of the (pseudo) BT model. To arrive at the classical BT model, we assume that each player’s performance follows a Gumbel distribution, with the location parameter equal to the player’s skill level. Even with this, there’s still a key detail that needs clarification—Where does this variance or randomness in the match come from?
In earlier seminal papers, this part was left unspecified. Here, we provide a complete explanation:
The BT model assumes that different annotators have a bias in terms of their certainty regarding different responses. These biases follow a Gumbel distribution (hence, the difference between them follows a logistic distribution). Similarly, we can assume that these biases follow a Gaussian distribution, in which case their difference would also be Gaussian—leading to the erf version of the BT model.
Additionally, we can provide another perspective (inspired by discussions in some cognitive psychology literature on cognitive bottlenecks). The correctness of annotations (i.e., whether they correctly rank the true reward values) depends on the absolute difference between the rewards. Intuitively, if the true scores of response 1 and response 2 are r1 and r2 respectively, and if r1 and r2 are very close to each other, then the probability of mislabeling the preference between these two responses increases, making the annotation more like a random guess. At the same time, different annotators have different abilities to distinguish these subtle differences. Ideally, a perfect annotator could differentiate any small reward difference, and this annotator would always rank preferences correctly with 100% accuracy according to the true rewards. On the other extreme, if an annotator is very poor, they might fail to correctly distinguish between responses even when the reward difference is large, and their labels would be random (with a 0.5 probability of correctness).
In the most general version of this, we use a function ξ, with a range of [0.5, 1], to represent the quality of the annotations. In Equation 15 below, if ξ=σ(β⋅Δr), we essentially return to the BT model. Equation 15 is a more generalized depiction of annotations, with the BT model being a special case within it.
Earlier, we discussed the assumptions behind the BT model and the logic of converting preference data into scores using the BT model. Since we are performing regression in the embedding space, where the ranking relationships between different prompt-response pairs can somewhat generalize to other prompt-response pairs, we don’t need to estimate parameters using as many samples as the classical BT model would require. Instead, we can make predictions for new prompt-response pairs using relatively fewer annotations.
Here, we can once again compare the differences between reward modeling and the traditional use of BT regression in sports events. In sports, we care about the probability of winning each match (at least, those betting on games do). In such cases, knowing the precise score for each team to make accurate predictions becomes crucial. However, in the context of reward modeling, we are not concerned with the exact probability of one prompt-response pair beating another. What we care about more is the rank order between them. When using a reward model (for example, during inference-time optimization), we receive multiple responses for a single prompt. We don’t need to precisely predict the probability that each response will beat the others; we just need to identify the best response.
(An interesting callback is to the book Algorithms to Live By, which I’ve recently been reading. In the second chapter, the author mentions that in many modern sports events, the only result that truly matters is the champion, while the second-place finisher isn’t always the actual second-best performer.)
From this perspective, using the BT model for reward modeling seems like it’s overly focused on details. By re-examining the data, we proposed a more general high-level optimization objective: Order Consistency. Formally, we define it as follows:
When given an annotated dataset, all we can do is replicate the (imperfect) annotations in the dataset. Let’s denote the ordering model by hat{H}. As long as we optimize \hat{H} well enough, this \hat{H} will not deviate too far from the true annotations.
Clearly, the BT model is an approach that optimizes Order Consistency. It explicitly expresses \hat{H} as the difference between two reward estimations. This formulation inherently possesses antisymmetry—if we reverse the positions of the two reward estimators, the sign of the ordering estimate will flip accordingly.
In addition, we point out that directly performing binary classification on samples labeled as positive/negative is also an optimization of Order Consistency, but it optimizes an upper bound of Order Consistency. Specifically, it requires that the predictions for positive prompt-response samples are greater than 0, and the predictions for negative samples are less than 0. Here, the explicit requirement for antisymmetry is no longer present—we expect the classifier to learn this property from the data.
Theoretically, the BT model optimizes the probability that one prompt-response pair wins in comparison to the other. On the other hand, the classification model optimizes the probability that a positive sample is a "good" sample. This direct classification approach essentially removes the influence of the negative samples, treating it as the marginal probability of this sample winning in the comparison.
In the paper, we present the following result: BT reward and classification reward can be connected by a constant independent of iii. The classification reward, when a constant is added, can upper-bound the BT reward.
In the previous analysis, we observed that all the derivations do not require the prompt-response pairs to come from the same prompt. This is because, under the assumption of BT regression, the comparisons occur in the embedding space, which represents a joint embedding of both the prompt and response. Intuitively, performing order-consistency learning across different points in the embedding space is essentially assigning scores to these points. These scores are globally applicable rather than specific to certain prompts.
In fact, the ability to compare different prompt-response pairs is an implicit prerequisite for reward modeling. This capability allows us to learn a reward model that can predict scores for any prompt-response pair. The reward we are learning is a universal function approximator (inspired by the concept of Universal Value Function Approximators, UVFA, in multi-goal reinforcement learning 🫡).
A useful analogy can be drawn from sports competitions: imagine we have many young soccer players living in different cities. These players compete within their own city and receive rankings based on their local matches. However, without cross-city competitions, it’s difficult to assess how these players perform on a broader, global level. Similarly, a scoring model trained solely on local data will be limited to local comparisons. If we want to predict the performance of a new player from an unknown city, we need more diverse match data—ideally not just intra-city matches but also inter-city competitions.
In classical classification problems, the choice of data for annotation is crucial: we cannot select only obvious positive or negative examples, as the training task would become too simple, leading to poor performance when faced with more complex examples. Similarly, if we focus only on examples near the classification boundary, overfitting becomes a risk. For reward modeling, this challenge translates into the risk of reward hacking.
In prior work, such as RPO [3], the idea of comparing responses from different prompts has already been suggested. For example, in the context of evaluating a helpful chatbot, we can compare responses from an LLM to different prompts to label whether the response to one prompt is more helpful than the response to another prompt. While some prompts might be “easier,” making it more likely for the model to generate helpful answers, the key idea is that cross-prompt comparisons help us learn generalizable descriptions of what makes a response helpful.
Moreover, cross-prompt comparisons can lead to higher-quality annotations. Intuitively, randomly selecting two prompts and their corresponding responses for comparison creates greater diversity and makes it easier to identify the better response. This contrasts with the more restricted case of comparing two responses to the same prompt, where differences may be subtler. As a result, we present the following result
A more rigorous proof needs to take into account the accuracy of the annotations. To prove that, in expectation, the quality of annotations improves due to randomly selecting prompts, we derived the following result.
Therefore, we were curious to see, experimentally, whether allowing annotations between different prompts could be more beneficial for reward modeling. In our experiments, we conducted a thorough exploration of this aspect.
This part of the results will be updated after the paper is posted on arXiv. We mainly conducted three aspects of experiments:
(All experiments were repeated five times, and all embeddings, training code, models, the responses generated in our experiments, and the training and testing data will be open-sourced.)
To reproduce our code, each experiment takes less than ten minutes on a 128-core CPU machine (we conducted more than 12,000 experiments in total, cross-validating performance under various conditions).
We explored the performance differences between BT Reward Model and Classification Reward Model when using embeddings as inputs.
We found that in most of our experiments, the Classification Reward Model could achieve performance comparable to the BT Reward Model. However, the Classification Reward Model is much more flexible—it can be implemented using any existing classifier, such as MLP, or tree-based models like LightGBM/XGBoost.
We studied the performance changes of different reward modeling methods under varying annotation quality and quantity.
We found that as annotation quality decreases or data quantity reduces, the BT Reward Model underperforms compared to the Classification Reward Model.
We compared the performance of reward modeling between cross-prompt annotations and annotations on multiple responses from a single prompt, and we stress-tested cross-prompt scenarios, demonstrating the conditions under which cross-prompt can bring greater performance improvements.
We found that cross-prompt annotations can lead to significant performance improvements when the diversity of responses is low. In practical RLHF workflows (which generally default to online methods because they yield significantly better results), where only two responses are randomly generated for annotators to label, cross-prompt annotation can significantly improve the effectiveness of reward modeling.
Prompt-OIRL: Using Offline Inverse RL for prompt optimization. It is cost-effective and efficient—a reward model trained in just five minutes can significantly improve mathematical capabilities.
DenseReward: Utilizing attention in RLHF to allocate rewards to tokens, aiming to address the credit assignment problem in RLHF.
RATP: Describes the use of LLMs within the framework of MDP, introducing a thought process that includes CoT (Chain of Thought), ToT (Tree of Thought), and Graph-of-Thought, all aimed at improving reasoning abilities. The idea of using dense rewards/progress feedback for search forms the technical core of inference-time optimization or search-based generation (e.g., o1 model).
InverseRLignment: Approaches SFT (Supervised Fine-Tuning) from an Inverse RL perspective, highlighting the difficulties of alignment with preference-based data and proposing solutions for Alignment from Demonstration.
DataCOPE: Highlights that the essence of Reward Modeling is Off-Policy Evaluation (OPE). OPE does not necessarily improve with more data; what matters is the quality, not quantity—the key is the alignment between the data and the policy being evaluated. In reward modeling, less can be more!
Recently posted workshop position paper: Emphasizes that the Reward Model is the only and essential path for inference-time optimization. From an RL perspective, LLMs operate as online dynamics with an offline Reward Model, making the accuracy of the reward model a core issue for post-training and efficient, effective deployment.
References:
[1] Asymptotic Theory of Sparse Bradley-Terry Model
[2] Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
[3] Yin, Yueqin, et al. "Relative Preference Optimization: Enhancing LLM Alignment Through Contrasting Responses Across Identical and Diverse Prompts." arXiv preprint arXiv:2402.10958 (2024).