Preference Transformer: Modeling Human Preferences using Transformers for RL

Changyeon Kim*, Jongjin Park*, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee

*Equal Contribution

TL;DR

We introduce a transformer-based architecture for preference-based RL considering non-Markovian rewards for capturing human preferences over more complex tasks.

Abstract

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors without pre-defined rewards. However, preference-based RL has been challenging to scale since it requires a large number of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on weighted sum of non-Markovian rewards. We then design the proposed preference model using transformer architecture by stacking causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while the prior approach fails to work. We also show that Preference Transformer indeed induces a well-aligned reward and attends to critical events in the trajectory by capturing the temporal dependencies in human decision.

Preference Transformer (PT)

We present Preference Transformer, a neural architecture for modeling human preferences based on weighted sum of non-Markovian rewards. Preference Transformer takes a trajectory segment as input, which allows extracting task-relevant historical information. By stacking bidirectional and causal self-attention layrs, Preference Transformer generates non-Markovian rewards and importance weights as outputs. We utilize them to define the preference model and find that Preference Transformer can induce a better-shaped reward and attend to critical events from human-generated preferences.

Preference predictor

Preference attention layer

Attention Weight Analysis

We provide the sampled visualization supporting the results of Figure 3 in the original draft. Video (left) visualizes the success/failure trajectory segments from antmaze-large-play-v2. For a better visibility, we represent the weight as a color map on the rim of the video, i.e., we highlight frames with higher weights using more bright colors. Graph (right) shows the learned reward function and importance weight of each segment. We observe that the learned importance weights are well-aligned with human intent.

Success Case

Failure Case

Learning Complex Novel Behaviors

We demonstrate that Preference Transformer can be used for learning complex novel behaviors (Hopper multiple backflip) where a suitable reward function is difficult to design. This task is more challenging compared to a single backflip as the reward function must capture non-Markovian contexts including the number of rotations. We observe that the agent trained with Preference Transformer performs multiple backflips with a stable landing, while it struggles to land when it uses MLP-based Markovian Reward (MR). We would like to emphasize that the training setup for each agent is identical: both reward functions are trained with the same 300 human queries and trajectories are rolled out at the same iteration.

Preference Transformer (PT)

Markovian Reward (MR)

Cite

@inproceedings{

kim2023preference,

title={Preference Transformer: Modeling Human Preferences using Transformers for {RL}},

author={Changyeon Kim and Jongjin Park and Jinwoo Shin and Honglak Lee and Pieter Abbeel and Kimin Lee},

booktitle={International Conference on Learning Representations},

year={2023},

url={https://openreview.net/forum?id=Peot1SFDX0}

}