PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers

Dezhong Zhao*, Ruiqi Wang*, Dayoon Suh, Taehyeon Kim,

Ziqin Yuan, Byung-Cheol Min, and Guohua Chen

*:equal contribution

Abstract

Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. We demonstrate that PrefMMT consistently outperforms state-of-the-art PM baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.

Comparison with Previous Preference Modeling Methods

Comparison of previous methods and our PrefMMT for preference modeling in PbRL. (a) Markovian Reward Modeling: Assumes that human preference for a trajectory is based on the equal sum of individual evaluations at each time step. The goal is to learn a Markovian reward model that assigns rewards based solely on the immediate state-action pair. (b) Unimodal Sequence Modeling: Regards a trajectory as a sequence and learns a series of non-Markovian rewards that depend on all previously visited time steps. By learning unimodal attention weights with unimodal transformer networks, this method aims to infer temporal dependencies within the trajectory and identify critical time steps that significantly influence human judgments. (c) Our Multimodal Sequence Modeling: Recognizes the multimodal nature of a trajectory, disentangling the state and action modalities. By learning multimodal attention weights via a multimodal transformer architecture, our approach captures both temporal intra-modal dependencies and inter-modal interactions between states and actions within the trajectory, leading to more nuanced credit assignment for human preferences.

Framework Overview

Illustration of the PrefMMT framework. Given a robot behavior trajectory as input, we first decouple the state and action modalities. Each unimodal sequence is then processed through an intra-modal encoder, where the temporal dependencies within the transitions of states and actions are explored. Subsequently, an inter-modal joint encoder captures the interactions between actions and states, outputting a series of non-Markovian rewards.

Experimental Demos

Pref_MMT can lead to more efficient robot behaviors

Preference Reward and Attention Analysis

Learned reward increases as ant approaches goal

State intra-modal attention increases and keeps high when ant's direction (postion changes) is positive

Learned reward keeps low (max ≈ 0.3) while ant trapped

Action intra-modal attention keeps high when ant keeps flipped

Learned reward is high when robot pertorms promising behavlors

State-action inter-modal attention increases while robot is reaching the handle and pushing the window

Learned reward is low (max ≈ 0.4)while the robot remains stuck in one position

State-action inter-modal attention increases and keeps high durings winging motion

Experimental Details

For all the experiments, we use the same JAX-GPU as in the baseline for reproduction, and conduct them on an NVIDIA GeForce RTX 4090 GPU, following the hyperparameter configuration in Table 1.

During the experiments, we used a causal transformer with three layers and four attention heads, along with the AdamW optimizer set with a learning rate of 1e-4, and applied a linear warmup for 5% of the total gradient steps. The batch size was set to 256. In the RL training process, we followed the original open-source IQL and conducted experiments using its default hyperparameter configuration, except for the hidden dimension.

For the tasks we constructed under MetaWorld, we proposed evaluation criteria utilized for human labeling for each task.

Window Close

Window Close is a robotic arm motion control task where the objective is to randomly initialize the position of the button, drive the robotic arm to move, and control the end effector to push and close a window.

If either agent in the two trajectories exhibits behavior that meets human expectations, preference judgment is made based on the task objectives, followed by trajectory state evaluation.In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions while moving the window, failing to accurately close the window, or the window not being fully closed, the trajectory that maintains a normal posture longer or closes the window more accurately or with greater depth is chosen as the preferred option according to the task objectives.If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.

Drawer Open

Drawer Open is a robotic arm motion control task where the objective is to randomly initialize a drawer's position, drive the robotic arm to move, and control the end effector to open the drawer.

If neither agent in the two trajectories exhibits behavior that does not meet human expectations, preference judgment is made based on task objectives, followed by trajectory state evaluation. In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions or failing to fully open a drawer, the trajectory that maintains a normal posture longer or opens the drawer wider is chosen as the preferred option according to the task objectives. If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.

Sweep Into

Sweep Into is a robotic arm motion control task where the objective is to randomly initialize the position of a target ball, drive the robotic arm to move, and control the end effector to grip the target ball and move it to the designated task position.

If neither agent in the two trajectories exhibits behavior that does not meet human expectations, preference judgment is made based on task objectives, followed by trajectory state evaluation. In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions or failing to move the target ball to the desired location, the trajectory that maintains a normal posture longer or moves the target ball closer to the target location is chosen as the preferred option according to the task objectives. If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.

Button Press

Button Press is a robotic arm motion control task where the objective is to randomly initialize the position of a button, drive the robotic arm to move, and control the end effector to press the button from top to bottom.

If neither agent in the two trajectories exhibits behavior that does not meet human expectations, preference judgment is made based on task objectives, followed by trajectory state evaluation. In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions while moving the target ball or failing to accurately press the button, or if the button is not fully pressed, the trajectory that maintains a normal posture longer or presses the button more accurately or with greater depth is chosen as the preferred option according to the task objectives. If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.

TABLE 1

Page updated

Google Sites

Report abuse