Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca D. Dragan, Daniel S. Brown
@ ICLR 2023
We compare policies optimized with a reward learned from preferences (PREF) against policies optimized with the true reward (GT). State features on which preferences are based are fully-observable. Reward functions were trained with 52326 unique pairwise preferences. Both PREF and GT are optimized with 1M RL iterations and averaged over 3 seeds. Despite high pairwise preference classification test accuracy, the policy performance achieved by PREF under the true reward is very low compared with GT. However, the reward learned from preferences consistently prefers PREF over GT. This suggests that preference-based reward learning fails to learn a good reward for each of these tasks.
Reacher - GT
Reacher - PREF
Feeding - GT
Feeding - PREF
Itch Scratching - GT
Itch Scratching - PREF
Reacher, Half Cheetah, and Lunar Lander: https://github.com/jeremy29tien/gym
Feeding and Itch Scratching: https://github.com/jeremy29tien/assistive-gym