PARS ICML25 Rebuttal

Response to Reviewer snRz

[Table A] Performance comparison of PARS with representative prior algorithms (EDAC, CQL, IQL, RIQL) under the reward corruption scenario, where Uniform[−30, 30] noise was added to 30% of the original dataset rewards, following [Ref.3]. PA was applied with α = 0.01, and the scores for EDAC, CQL, IQL, and RIQL were taken from [Ref.3]. RIQL is an algorithm proposed to make IQL robust against various types of corruption. We averaged the scores over five random seeds, with ± indicating the standard deviation. The results show that PARS outperforms RIQL—even though RIQL was specifically designed for robustness—when the reward scale exceeds 100.

[Table B] PARS AntMaze performance without critic ensembles. We averaged the scores over five random seeds, with ± indicating the standard deviation.

Response to Reviewer QMiG

[Fig A] Following the setup from [Ref.3], we plot the Q-values for actions from the feasible action region (-1 to 1) at a fixed state in the Inverted Double Pendulum with a 1D action space. Additionally, in-sample actions that appear in the dataset for the corresponding state are marked in blue. To obtain these, we quantized the state following the method described in Appendix E of [Ref.3].

Blue: in-sample actions, Red: OOD actions

[Table C] PARS performance comparison across different values of alpha. We averaged the scores over five random seeds, with ± indicating the standard deviation.

Response to Reviewer mnXg

[Fig B] Extended version of manuscript Figure 9 with isolated analysis of PA effect.

[Table D] Comparison of PARS performance with prior SOTA on the NeoRL-2 benchmark (algorithm in parentheses indicates the previous best-performing method). Each environment has distinct characteristics: Pipeline captures control with delayed action effects; Simglucose involves delayed responses and external variability in medical treatment. Fusion highlights extreme data scarcity in high-cost systems, while SafetyHalfCheetah enforces strict safety constraints with heavy penalties for violations. The baseline algorithms each involve tuning 4 to 16 hyperparameters, and we fixed alpha at 0.1 while tuning beta to {0.01, 0.05, 0.5}.

Page updated

Google Sites

Report abuse