Generative Verifiers:

Reward Modeling as Next-Token Prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal

ICLR 2025
Paper | Data

Abstract

Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in a 16 − 40% improvement in the number of problems solved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, we find that training GenRM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.

Verification as Yet Another Reasoning Problem

An illustration of generative verifiers, namely GenRM and GenRM-CoT.

Given a question and a candidate solution, GenRM directly finetunes an LLM to answer the question ‘Is the answer correct (Yes/No)?’ via SFT on the next-token response corresponding to either ‘Yes’ or ‘No’. During inference, the verifier score is obtained by extracting the probability of the ‘Yes’ token.
In comparison, GenRM-CoT finetunes a LLM to produce verification chain-of-thought (CoT) rationale before yielding the final Yes/No token. At test-time, we sample multiple CoT rationales and use majority voting to compute the average probability of ‘Yes’, enabling GenRM-CoT to utilize additional inference-compute for better verification.

A motivating example

Example using generative CoT verifier on GSM8K test.

LLM-generated solutions often sound convincing even when they are wrong, making verification a challenging task. Here, the solution is incorrect because it has ignored the word ‘each’ in the problem. While the discriminative RM fails to recognize this subtle mistake in the solution, our GenRM-CoT verifier reliably detects the error. This is because GenRM-CoT was trained with next-token prediction on synthetic chain-of-thought rationales, enabling it to explicitly reason about the solution.

Results

Generative verifiers outperform standard verification approaches in terms of Best-of-N on reasoning tasks, with a fixed generator. Here, Δ = (GenRM-CoT − Disc-RM)/(Pass@N − Disc-RM) measures how much better generative CoT verifiers perform than discriminative RM, as a fraction of the maximum achievable gains over discriminative RM from an oracle verifier (Pass@N). GenRM-CoT leverages the generation capabilities of LLMs, enabling a finetuned verifier to utilize chain-of-thought verification to detect subtle reasoning errors.

For algorithmic tasks, we report average performance using Gemma-2B on Last Letter Concat (Wei et al., 2022) and BBH Word Sorting (Suzgun et al., 2022). For math reasoning, we train Gemma2-9B verifiers on GSM8K and evaluate their performance on GSM8K test (middle) and easy-to-hard generalization on MATH500 (Lightman et al., 2023). For math tasks, LLM-as-a-Judge utilizes Gemini 1.0 Pro, which we used for synthetic verification rationales for GSM training. The algorithmic reasoning tasks use programmatically generated oracle verification rationales as training data for GenRM-CoT. Math tasks use model-generated verification rationales for training GenRM-CoT.

Scaling Test-Time Compute

Scaling Inference-time Compute for Verification on GSM8K.

By posing reward modeling as next-token prediction, GenRM-CoT can utilize Chain-of-Thought and Majority Voting, to turn additional test-time compute into higher percentage of problems solved under Best-of-N. Here, the horizontal line corresponds to performance of GenRM-CoT verifier with greedy decoding.

Easy-to-Hard Generalization

Easy-to-Hard Generalization on MATH, with Gemma2-9B verifiers trained only on significantly easier grade-school math problems. Compared to discriminative RMs, GenRM-CoT performs especially well on Pre-Algebra, Algebra, and PreCalculus, and obtains superior performance across all difficulty levels.

More Examples

Another example where GenRM-CoT catches a subtle mistake that the discriminative verifier is unable to catch.

The candidate solution did not convert 90 minutes into 1.5 hours before dividing it by 7.5. However, the discriminative verifier was not able to detect this mistake likely because the solution does still appear to produce a valid-sounding percentage 90/7.5 = 12. Our proposed GenRM-CoT model is able to identify this mistake using step-by-step generative verification.

An example on MATH where GenRM-CoT (trained only on GSM) detects a reasoning error.

The solution made a mistake in simplifying an intermediate step. Both Discriminative RM and GenRM-CoT models have only been trained on GSM8K. In this case, discriminative RM fails to classify the solution as incorrect, whereas GenRM-CoT utilizes chain of thoughts to catch this mistake

Page updated

Google Sites

Report abuse