Response to nSQF
Caption: In Fig 5a (left) we fix the data budget to n=1024, and in Fig 5a (right) we fix the compute budget H=64. In a new experiment (above), we scale both n and H, and note the performance gap between RL and SFT scale superlinearly (roughly as sqrt(H)), in agreement with our result in Theorem 5.1.
Response to eQEB
Caption: We extend Figure 7 to single prompt setting. In Fig 7, we bucket problems into easy/hard (base LLM success rate >0.3 or <0.3), and plot the distribution of bi-level rewards on correct solution traces (higher reward means more test-time compute efficiency), obtained for problems in both buckets. So, we already show that Llama 3.1-8b is roughly 0.25 anti-concentrated on MATH prompts. To show this more clearly, we also show anti-concentration on individual problems. In a new experiment (above), we show this holds roughly in practice too. Since anti-concentration holds more trivially on easy problems, we randomly pick 16 hard problems (base LLM success rate < 0.25 on each) and visualize the reward distribution conditioned on each prompt separately, and note that on these hard problems too the anti-concentration parameter is not too small >0.23. Though, we agree that there may be prompts where anti-concentration is indeed much smaller. But, in such cases too it is possible for VB >> VF when we extend the anti-concentration definition to an average-case setting, as discussed in our rebuttal.
Ablations for anti-concentration: Our experiments do consider more ablations over heterogeneity since that is a key distinguishing factor between VB and VF methods. We generally found in our experiments that LLMs that are heterogeneous already also satisfied the anti-concentration condition, though we agree that we should study the same conditions of other base models. That said, doing fine-tuning with both VB and VF methods in the time for this rebuttal with 3B, 8B models form other families has been compute intensive and we plan to add this in the final version of this paper
Extending worst-case anti-concentration to an average case notion while arriving at the same separation between VB & VF methods as Thm. 5.8
Response to eQEB and qSM1