Section A: MRT Results for the Open-Ended Problem Setting (running MRT on top of DeepSeek-R1 distill base models)
Section B: Performance gains from MRT (RL) and MRT (STaR)
Section C: Clarification regarding Equation 2 (training objective for MRT)
Section D: Regret measurements for R1 models on OmniMATH
Section E: Total computation cost of MRT compared to outcome-reward GRPO
Section F: Standard outcome-reward RL vs MRT
Section G: Extrapolation to larger regrets with and without MRT
Section H: Analyzing DeepSeek-R1 on more problems
We fine-tuned base models using MRT (models that already produce traces with <think> markers). For STaR, we used two sizes of models (7B and 1.5B) on 10,000 NuminaMate samples, comparing MRT (with progress bonus) against vanilla STaR (outcome-only reward). For RL, we used 1.5B models, comparing MRT with vanilla GRPO (outcome-reward RL). We fine-tuned one model on 4,000 NuminaMath problems, while another (already fine-tuned on 40K MATH problems) was trained on 919 AIME problems. We also compared against an RL approach that penalizes token length. Evaluation used a 16K token budget, matching our fine-tuning maximum.
Following the protocol in DeepScaleR, we report the pass@1 performance of outcome-reward RL and MRT on multiple math reasoning datasets: AIME 2025, AIME 2024, AMC 2023, using 20 samples per problem to reduce noise due to limited size.
We also run an additional comparison on top of the DeepScaleR-1.5B model, where we apply an explicit length penalty to improve token efficiency for the model, analogous to this approach. In agreement with the findings of this concurrent work, we find that incorporating a length penalty results in worse pass@1 accuracies of the model.
We redraw major performance in Figures 7 and 8 to highlight the token efficiency brought by MRT.
STaR with 8B base. We plot maj@K performance of models on AIME for K ∈ [1, 10] against the total tokens spent. We also run linearized search (dashed line) for MRT (rest are parallel).
RL with 3B base. Similarly to the left plot, we report maj@K against the total tokens spent.
The term in red corresponds to the reward bonus and it is provided under the distribution of contexts c𝑗−1 consisting of prefixes produced by the previous LLM checkpoint, shown as 𝜋old. The meta prover policy 𝜇 can be any other LLM (e.g., an “-instruct” model which is told to utilize episodes so far to guess the best answer) or the same LLM 𝜋 itself after its thought block has terminated.
Below is the R1 normalized regret curve for the 32B model. Each point represents the regret thus far normalized by the number of episodes or tokens. As there are more episodes, direct and [maj@k]j outperforms maj@1.
Here is an example of a trajectory used for R1 analysis and generated in the open-ended setting. An episode is defined as a continuous segment of the model's thought (i.e., text enclosed between the '<think>' and '</think>' markers) uninterrupted by words such as "Wait" and "Alternatively" which break the current flow of logic. A "Time is up" sentence is appended after the model has spent multiple episodes, forcing the model to generate a solution directly without further thinking.
We performed a detailed analysis of the computational costs associated with our proposed MRT method compared to classical approaches like STaR and GRPO. The analysis quantifies both forward generation and training costs using established FLOP estimation formulas.
To estimate computation costs, we employ two key formulas:
N represents the number of model parameters, Drolluts is the total number of tokens generated during inference, and Dtrain is the total number of tokens used during training.
For the STaR baseline, we sampled 200 full rollouts per problem and selected solutions that correctly solved each problem. In contrast, for MRT (STaR), we generated just 1 complete rollout per problem, then selected 10 prefixes from this rollout and sampled 20 continuations for each prefix to approximate the information gain.
When applied to the NuminaMATH dataset containing 20,000 problems, using Llama-3.1-8B-Instruct with 4,000 token completions, this approach yielded 12,000 correct solutions for training (with incorrect solutions discarded). Training proceeded for three epochs.
The total FLOP calculations are:
For the GRPO baseline and MRT (RL), we used a different sampling strategy. In MRT (RL), we first generated 1 complete rollout per problem, then selected a prefix and sampled 10 rollouts to approximate information gain. In both methods, given a prompt, we sampled 4 responses and maximized their group advantage estimations.
The resulting FLOP calculations are:
Our analysis reveals that MRT (STaR) requires only 1.01× more FLOPs (2.64192 × 10²⁰ / 2.62912 × 10²⁰) than STaR to achieve comparable performance, while using 1.7× fewer tokens during inference. Similarly, MRT (RL) uses just 1.08× more FLOPs than GRPO to achieve equivalent performance, while requiring 1.6× fewer tokens during inference.
These results demonstrate that our MRT approach achieves a favorable trade-off between computational cost and token efficiency, making it particularly valuable for deployment scenarios where inference efficiency is critical.
Standard techniques for fine-tuning LLMs to use test-time compute optimize outcome reward at the end of a long trace. This does not incentivize the model to make use of intermediate tokens to make progress (i.e., probability of eventual success) and leads to 1) unnecessarily long output traces and 2) inability to make steady progress on new, hard problems as shown in (a). MRT, shown in (b), trains the LLM to minimize cumulative regret over the entire output stream (red, shaded area) by optimizing a dense reward function in addition to sparse 0/1 reward and thus alleviates both challenges in (a).
The first four points are at budgets 4096, 8192, 12288, and 16384. The next four points in dashed lines are extrapolations to C0 = 20480, 24576, 28672, and 32768, which correspond to 2, 4, 6, and 8 extensions of the output trace, following the budget forcing technique in s1. In the left plot, we run the STaR variant of MRT and the right plot corresponds to the DeepScaleR-1.5B-Preview base model where we run the RL variant. In both cases, we conduct this study on AIME 2025. Observe that MRT leads to the smallest normalized regret, both when evaluating within the maximal budget and when extrapolating to larger budgets, even when outcome-reward training (e.g., Qwen-7B STaR) starts to plateau and collapse to the base model.
We extend Appendix C to further analyze DeepSeek R1 on a larger set of 293 AIME problems from the last 10 years (2015 - 2024). We show R1's scaling curves on these problems, and further include an analysis of the regret of these curves. As shown, in solutions with more episodes, direct and [maj@k]j more often outperform maj@1 in accuracy and have lower regret.