Listwise Reward Estimation for
Offline Preference-based Reinforcement Learning
ICML 2024
ICML 2024
Heewoong Choi Sangwon Jung Hongjoon Ahn Taesup Moon
Seoul National University
TL;DR
We propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL.
LiRE leverages second-order preference information by constructing a Ranked List of Trajectories (RLT).
RLT can be efficiently constructed using the same ternary feedback type as traditional methods.
Motivation
There are two main approaches to utilizing second-order preference information.
☹️ Expensive to obtain than ternary feedback type (i.e., more preferred, less preferred, equally preferred).
☹️ Exist incomparable pairs. Also, the short length of the lists limits the ability to utilize second-order information fully.
Q. Can we use the simple ternary feedback type while still utilizing the second-order preference effectively?
Listwise Reward Estimation (LiRE)
LiRE consists of three main steps.
We use binary search (you can use any efficient sorting methods) to quickly find the position of new sample
We use pairwise loss for learning the reward model, but it allows the reward model to learn second-order information.
Employing the linear score function helps to amplify the reward differences in high-reward regions.
Experiments
Feedback efficiency = (# of preference pairs) / (# of feedbacks)
Sample diversity = (# of sampled trajectories) / (# of feedbacks)
With LiRE, each trajectory finds its place in ~2 comparisons.
The estimated rewards with LiRE are more highly correlated than with the independent pairwise method.
LiRE performs much better by using second-order preference
Human experiments
We show that LiRE can be effective in real-world scenarios.
The figure below shows some trajectories in the constructed RLT through LiRE.
Least preferred
<
<
<
<
Most preferred