Table A. Normalized scores after online fine-tuning for each environment in MuJoCo domain. r=random, m=medium, m-r=medium-replay. All results are reported as the mean and 95% confidence interval over 10 random seeds.
Table B. Normalized scores after online fine-tuning for each environment in Antmaze domain. All results are reported as the mean and 95% confidence interval over 10 random seeds.
Table C. Normalized scores after online fine-tuning for each environment in Adroit domain. All results are reported as the mean and 95% confidence interval over 10 random seeds.
Table D. Normalized scores of OPT with different initialization methods in the Antmaze domain. "Pre-trained with B_(off) and B_(on)" refers to a simple alternative trained on both datasets. For further explanation, please refer to Rebuttal R3-3. All results are reported as the mean and 95% confidence interval over 5 random seeds.
Table E. Comparison of normalized scores after online fine-tuning for each environment on the D4RL benchmark. We denote the backbone algorithm as "Vanilla" and the result of the algorithm integrated with OPT as "Ours". All results are reported as the mean and 95% confidence interval over 10 random seeds.
Table F. Ablation study results on κ in the Antmaze domain. All values are reported as the mean and 95% confidence interval over 5 random seeds. For further explanation, please refer to Rebuttal R3-4 or R4-2.
Table G. Results of the ablation study on the addition of a new value function. All results are reported as the mean and 95% confidence interval over 5 random seeds.
Figure A. Estimation bias of value function, comparing TD3 and TD3+OPT against optimal value function. The bias for the OPT remains initially flat due to the Online Pre-Training phase. For further explanation, please refer to Rebuttal R1-3.
Figure B. Learning curves for different initialization methods, as discussed in Section 5.1 of the main paper. Solid lines indicate mean performance, and shaded regions represent 95% confidence intervals over 5 random seeds. For further explanation, please refer to Rebuttal R2-1.
Figure C. Estimation bias of the value function, comparing Qᵒᶠᶠ⁻ᵖᵗ, Qᵒⁿ⁻ᵖᵗ, and the combined Q against the optimal value function. The combined Q corresponds to the formulation used in OPT, as defined in Equation 4 of the main paper. For further explanation, please refer to Rebuttal R2-2.
Figure D. Interquartile Mean (IQM) comparison of baseline methods on the MuJoCo, Antmaze, and Adroit domains. The x-axis represents normalized scores. All results are based on 10 random seeds. IQM summarizes performance by averaging scores within the 25th to 75th percentile, reducing the influence of outliers and better capturing consistent trends across tasks.
MuJoCo
Antmaze
Adroit
Figure E. Aggregated learning curves for baseline methods on the MuJoCo, Antmaze, and Adroit domains. Solid lines represent mean performance, and shaded regions indicate 95% confidence intervals across 10 random seeds.
Figure F. Comparison of wall-clock training time for TD3 and TD3 integrated with OPT on the walker2d-random-v2 environment using a single NVIDIA L40 GPU. For further explanation, please refer to Rebuttal R3-5.