Aakriti Agrawal*, Minghui Liu*, and Furong Huang
* Equal Contribution
Paper: https://arxiv.org/abs/2605.30451
Code Repository: https://github.com/umd-huang-lab/VeriGate
Can we use dense step-level feedback using trained PRMs to fix GRPO's zero-gradient and credit-assignment failures?
Abstract: Group Relative Policy Optimization (GRPO) is an effective recipe for training reasoning models with verifier-based outcome rewards, but its supervision is sparse: when all sampled trajectories for a prompt receive the same verifier reward, the group-relative advantage collapses to zero and learning stalls. Outcome-only rewards also provide no step-level credit assignment, limiting exploration and making it harder to learn robust reasoning. We present VeriGate (Verifier-Gated Step-Level GRPO), a verifier-gated extension of GRPO that addresses these limitations with three design choices. First, VeriGate keeps the verifier in charge whenever verifier rewards induce a meaningful preference among sampled trajectories, and uses process supervision only when verifier rewards are degenerate. Second, instead of collapsing Process Reward Model (PRM) step scores into a single trajectory reward, VeriGate converts them into future-cumulated rewards to assign continuation-aware credit. Third, VeriGate transforms these rewards into group-normalized token-level advantages, restoring informative gradients and fine-grained credit assignment while remaining less susceptible to reward hacking than methods that optimize aggregated PRM scores.
Empirically, training on MATH with 1.5B and 7B Qwen2.5-Instruct models and evaluating on six reasoning benchmarks, VeriGate improves average accuracy by about 20% and 12% for 1.5B and 7B models respectively, substantially reduces zero-gradient failures, decreases reward-hacking behavior, and improves reasoning quality relative to outcome-only GRPO and PRM-as-outcome baselines.
--------------------------------------
Problem with Standard Group Relative Policy Optimization (GRPO) : Its uses verifier-based outcome rewards which is reliable but incredibly sparse.
Reward Degeneracy: When all sampled trajectories for a prompt receive the same verifier reward, the advantage collapses to zero and learning stalls.
Outcome-only rewards provide no step-level credit assignment, leading to incorrect trajectories and stalling exploration.
One Solution: Use PRMs
But PRMs are noisy and imperfect leading to reward hacking.
Our Solution: VeriGate (Verifier-Gated Step-Level GRPO) VeriGate integrates process supervision into GRPO through three core design choices.
Smart Gating: VeriGate keeps the verifier in charge whenever verifier rewards induce a meaningful preference. It only uses PRM process supervision as a fallback when all sampled trajectories receive a zero verifier reward (degenerate groups).
Future Cumulation of Step-Rewards: Instead of collapsing PRM step scores into a single brittle trajectory reward, VeriGate converts them into future-cumulated rewards. This ensures that tokens are credited accurately based on the quality of the continuation they enable (continuation-aware credit assignment).
Novel Token-Level Advantage Calculation: VeriGate transforms these future-cumulated rewards into group-normalized token-level advantages. This restores informative, fine-grained gradients while making the model strictly less susceptible to reward hacking than methods that optimize aggregated PRM scores.
@misc{agrawal2026verigateverifiergatedsteplevelsupervision,
title={VeriGate: Verifier-Gated Step-Level Supervision for GRPO},
author={Aakriti Agrawal and Minghui Liu and Furong Huang},
year={2026},
eprint={2605.30451},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.30451},
}