PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers
Dezhong Zhao*, Ruiqi Wang*, Dayoon Suh, Taehyeon Kim,
Ziqin Yuan, Byung-Cheol Min, and Guohua Chen
*:equal contribution
Dezhong Zhao*, Ruiqi Wang*, Dayoon Suh, Taehyeon Kim,
Ziqin Yuan, Byung-Cheol Min, and Guohua Chen
*:equal contribution
Abstract
Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. We demonstrate that PrefMMT consistently outperforms state-of-the-art PM baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.
Comparison with Previous Preference Modeling Methods
Framework Overview
Illustration of the PrefMMT framework. Given a robot behavior trajectory as input, we first decouple the state and action modalities. Each unimodal sequence is then processed through an intra-modal encoder, where the temporal dependencies within the transitions of states and actions are explored. Subsequently, an inter-modal joint encoder captures the interactions between actions and states, outputting a series of non-Markovian rewards.
Experimental Demos
State intra-modal attention increases and keeps high when ant's direction (postion changes) is positive
Action intra-modal attention keeps high when ant keeps flipped
State-action inter-modal attention increases while robot is reaching the handle and pushing the window
State-action inter-modal attention increases and keeps high durings winging motion
Experimental Details
For all the experiments, we use the same JAX-GPU as in the baseline for reproduction, and conduct them on an NVIDIA GeForce RTX 4090 GPU, following the hyperparameter configuration in Table 1.
During the experiments, we used a causal transformer with three layers and four attention heads, along with the AdamW optimizer set with a learning rate of 1e-4, and applied a linear warmup for 5% of the total gradient steps. The batch size was set to 256. In the RL training process, we followed the original open-source IQL and conducted experiments using its default hyperparameter configuration, except for the hidden dimension.
For the tasks we constructed under MetaWorld, we proposed evaluation criteria utilized for human labeling for each task.
Window Close is a robotic arm motion control task where the objective is to randomly initialize the position of the button, drive the robotic arm to move, and control the end effector to push and close a window.
If either agent in the two trajectories exhibits behavior that meets human expectations, preference judgment is made based on the task objectives, followed by trajectory state evaluation.In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions while moving the window, failing to accurately close the window, or the window not being fully closed, the trajectory that maintains a normal posture longer or closes the window more accurately or with greater depth is chosen as the preferred option according to the task objectives.If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.
Drawer Open is a robotic arm motion control task where the objective is to randomly initialize a drawer's position, drive the robotic arm to move, and control the end effector to open the drawer.
If neither agent in the two trajectories exhibits behavior that does not meet human expectations, preference judgment is made based on task objectives, followed by trajectory state evaluation. In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions or failing to fully open a drawer, the trajectory that maintains a normal posture longer or opens the drawer wider is chosen as the preferred option according to the task objectives. If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.
Sweep Into is a robotic arm motion control task where the objective is to randomly initialize the position of a target ball, drive the robotic arm to move, and control the end effector to grip the target ball and move it to the designated task position.
If neither agent in the two trajectories exhibits behavior that does not meet human expectations, preference judgment is made based on task objectives, followed by trajectory state evaluation. In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions or failing to move the target ball to the desired location, the trajectory that maintains a normal posture longer or moves the target ball closer to the target location is chosen as the preferred option according to the task objectives. If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.
Button Press is a robotic arm motion control task where the objective is to randomly initialize the position of a button, drive the robotic arm to move, and control the end effector to press the button from top to bottom.
If neither agent in the two trajectories exhibits behavior that does not meet human expectations, preference judgment is made based on task objectives, followed by trajectory state evaluation. In another scenario, if both agents exhibit behavior that does not meet human expectations, such as unstable actions while moving the target ball or failing to accurately press the button, or if the button is not fully pressed, the trajectory that maintains a normal posture longer or presses the button more accurately or with greater depth is chosen as the preferred option according to the task objectives. If a clear preference judgment cannot be made in the above situations, both trajectories may be assigned the same preference value for comparison.
TABLE 1