Four key questions need to answer:
Does STAR achieve strong performance with limited feedback in both online and offline settings?
Can preference regularization mitigate overfitting in the reward model?
Does policy regularization alleviate Q-value overestimation compared to prior methods?
How do the two regularization components complement each other to improve overall learning efficiency?
Overall Performance
We evaluate STAR on 18 tasks drawn from standard benchmarks. In the online setting, we use three locomotion tasks from the DeepMind Control Suite (DMControl) and three robotic manipulation tasks from Meta-World. For the offline setting, we include eight challenging control tasks from D4RL and four robotic manipulation tasks from Robosuite.
Accuracy of Reward Model
We compare the learned reward model with the ground-truth reward function on Cheetah Run and Window Open tasks. As shown in followiing figure, we plot time series curves for our method (STAR), its ablated variant without PMR, and the ground-truth reward. The results demonstrate that STAR with PMR produces a reward function that more closely tracks the ground-truth reward across time steps.
Cheetah Run
Window Open
Value Estimation
We assess the accuracy of value estimation by tracking the value estimate trajectory during the learning process on Cheetah Run. Left figure charts the average value estimate over 10,000 states and compares it to an estimate of the true value. The true value is computed as the average discounted return obtained by following the current policy using the ground truth reward.
A clear overestimation bias is observed during learning, with SURF overestimating by 32% and MRN by 25%. When constrained by the policy regularizer, STAR significantly reduces overestimation bias to just 7%, leading to a more accurate Q-function.
Contribution of Each Components
To assess the individual contributions of each technique and their interaction effects, we conduct an ablation study on preference margin regularization (PMR) and policy regularization (PR).
Right table reports the performance on three tasks: Cheetah Run, Window Open, and Sweep Into. Removing either PMR or PR leads to a notable performance drop across all tasks, with the absence of both causing the most severe degradation.
Left figure shows how varying paramter λ influences the performance of STAR.
Performance on Cheetah Run is measured by episode return, while Window Open and Sweep Into are measured by success rate.
Human Experiments
We demonstrate that our method enables agents to acquire novel and diverse skills from non-expert human feedback. Specifically, we showcase: (a) a Walker lowering its center of gravity and marching, (b) a Cheetah agent prowling, (c) a Hopper performing a backflip, and (d) a Cheetah executing a high jump—each trained with only 50 human preference queries. These results highlight the effectiveness of our approach in guiding complex behavior acquisition, particularly in scenarios where manual reward design is challenging.
Walker March
Cheetah Prowl
Hopper Backflip
Cheetah High Jump