Beyond Reward: Offline Preference-guided Policy Optimization

TL;DR

We propose a one-step offline preference-based reinforcement learning paradigm that directly optimizes the policy by preference supervision without learning a reward function separately. 

Abstract

This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. 

Offline Preference-guided Policy Optimization (OPPO)

We present the offline preference-guided policy optimization (OPPO) approach, which is a one-step paradigm that simultaneously models offline preferences and learns the optimal decision policy without requiring the separate learning of a reward function. This is achieved through the use of two objectives: an offline hindsight information matching objective and a preference modeling objective. By iteratively optimizing these objectives, we derive a contextual policy to model the offline data and an optimal context to model the preference. 

The main focus of OPPO is on both learning a high-dimensional z-space and evaluating policies within such space. This high-dimensional z-space captures more task-related information compared to scalar reward, making it ideal for policy optimization purposes. Furthermore, the optimal policy is obtained by conditioning the contextual policy on the learned optimal context.

Cite

@article{

kang2023beyond,

title={Beyond reward: Offline preference-guided policy optimization},

author={Kang, Yachen and Shi, Diyuan and Liu, Jinxin and He, Li and Wang, Donglin},

journal={arXiv preprint arXiv:2305.16217},

year={2023}

}