Preference-based Policy Optimization for Multi-Objective Reinforcement Learning

Abstract

Multi-objective reinforcement learning (MORL) is designed to explore performant trade-off policies that balance multiple conflicting objectives. However, most existing methods either require quantifiable prior preferences or focus on policy diversity without considering the decision maker's (DM's) preferences, leading to the discovery of many policies that lack practical significance. We propose a human-in-the-loop multi-policy optimization framework for preference-based MORL that interactively identifies policies of interest. Our approach proactively learns the DM's real-time implicit preference information without requiring any a priori knowledge. Furthermore, we integrate preference learning into a multi-policy parallel optimization framework, balancing exploration and exploitation to discover high-quality policies aligned with the DM's preferences. Evaluations on a complex decision-making task based on real-world quadrupedal robot simulation demonstrate the superiority of our approach over standard methods. Additionally, we conduct comprehensive comparisons on the MuJoCo benchmark and a multi-microgrid system design task against multiple state-of-the-art peer algorithms, with experimental results clearly validating the effectiveness of our proposed method.

RESULTS：Quadruped Robot Control

Performances of policies obtained by PBMORL versus scalarized PPO under different preferences and reward weight combinations.

PBMORL (OURS)

Figure 1. Training process under the high-speed preference in the Unitree GO2 Robot Control task. Subfigures A–J illustrate the evolution of the non-dominated policy set during training.