LiRE

Listwise Reward Estimation for

ICML 2024

Seoul National University

TL;DR

We propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL.
LiRE leverages second-order preference information by constructing a Ranked List of Trajectories (RLT).
RLT can be efficiently constructed using the same ternary feedback type as traditional methods.

Motivation

☹️ Expensive to obtain than ternary feedback type (i.e., more preferred, less preferred, equally preferred).

☹️ Exist incomparable pairs. Also, the short length of the lists limits the ability to utilize second-order information fully.

Q. Can we use the simple ternary feedback type while still utilizing the second-order preference effectively?

Listwise Reward Estimation (LiRE)

We use binary search (you can use any efficient sorting methods) to quickly find the position of new sample

We use pairwise loss for learning the reward model, but it allows the reward model to learn second-order information.

Employing the linear score function helps to amplify the reward differences in high-reward regions.

Experiments

Feedback efficiency = (# of preference pairs) / (# of feedbacks)
Sample diversity = (# of sampled trajectories) / (# of feedbacks)
- With LiRE, each trajectory finds its place in ~2 comparisons.

The estimated rewards with LiRE are more highly correlated than with the independent pairwise method.

Human experiments

Least preferred

Most preferred

Page updated

Google Sites

Report abuse