Multi-objective reinforcement learning (MORL) is designed to explore performant trade-off policies that balance multiple conflicting objectives. However, most existing methods either require quantifiable prior preferences or focus on policy diversity without considering the decision maker's (DM's) preferences, leading to the discovery of many policies that lack practical significance. We propose a human-in-the-loop multi-policy optimization framework for preference-based MORL that interactively identifies policies of interest. Our approach proactively learns the DM's real-time implicit preference information without requiring any a priori knowledge. Furthermore, we integrate preference learning into a multi-policy parallel optimization framework, balancing exploration and exploitation to discover high-quality policies aligned with the DM's preferences. Evaluations on a complex decision-making task based on real-world quadrupedal robot simulation demonstrate the superiority of our approach over standard methods. Additionally, we conduct comprehensive comparisons on the MuJoCo benchmark and a multi-microgrid system design task against multiple state-of-the-art peer algorithms, with experimental results clearly validating the effectiveness of our proposed method.
Performances of policies obtained by PBMORL versus scalarized PPO under different preferences and reward weight combinations.
Figure 1. Training process under the high-speed preference in the Unitree GO2 Robot Control task. Subfigures A–J illustrate the evolution of the non-dominated policy set during training.
Figure 2. Comparison with MORL baselines under the "balanced preference" setting.
Preference: High speed (πp1)
Average speed: 5.54 m/s
Average Torque: 28.46 N·m
Preference: High speed
Average speed: 5.78 m/s
Average Torque: 32.82 N·m
Preference: Moderate speed (πp2)
Average speed: 2.60 m/s
Average Torque: 13.68 N·m
Preference: Moderate speed
Average speed: 2.59 m/s
Average Torque: 13.65 N·m
Preference: Energy efficiency (πp3)
Average speed: 0.00 m/s
Average Torque: 6.27 N·m
Preference: High speed (πb5)
Average speed: 4.33 m/s
Average Torque: 47.05 N·m
Preference: High speed (πb4)
Average speed: 4.79 m/s
Average Torque: 47.34 N·m
Preference: Moderate speed (πb3)
Average speed: 0.14 m/s
Average Torque: 31.25 N·m
Preference: Energy efficiency (πb2)
Average speed: 0.48 m/s
Average Torque: 31.24 N·m
Preference: Energy efficiency (πb1)
Average speed: 0.07 m/s
Average Torque: 29.92 N·m
Preference: High speed (πc5)
Average speed: 5.68 m/s
Average Torque: 45.43 N·m
Preference: High speed (πc4)
Average speed: 4.94 m/s
Average Torque: 46.90 N·m
Preference: Moderate speed (πc3)
Average speed: 3.98 m/s
Average Torque: 46.36 N·m
Preference: Energy efficiency (πc2)
Average speed: 0.00 m/s
Average Torque: 40.55 N·m
Preference: Energy efficiency (πc1)
Average speed: 0.00 m/s
Average Torque: 43.45 N·m
Performances of non-dominated policies obtained by PBMORL under different preference settings.
Ant-v2 (f1 is preferred)
Ant-v2 (f2 is preferred)
HalfCheetah-v2 (f1 is preferred)
HalfCheetah-v2 (f2 is preferred)
Hopper-v2 (f1 is preferred)
Hopper-v2 (f2 is preferred)
Humanoid-v2 (f1 is preferred)
Humanoid-v2 (f2 is preferred)
Swimmer-v2 (f1 is preferred)
Swimmer-v2 (f2 is preferred)
Walker2d-v2 (f1 is preferred)
Walker2d-v2 (f2 is preferred)
Hopper-v3 (f1 is preferred)
Hopper-v3 (f2 is preferred)
Hopper-v3 (f3 is preferred)