A Constrained Multi-Objective Reinforcement Learning Framework
Abstract
Many real-world problems, especially in robotics, require that reinforcement learning (RL) agents learn policies that not only maximize an environment reward, but also satisfy constraints. We propose a high-level framework for solving such problems, that treats the environment reward and costs as separate objectives, and learns what preference over objectives the policy should optimize for in order to meet the constraints. We call this Learning Preferences and Policies in Parallel (LP3). By making different choices for how to learn the preference and how to optimize for the policy given the preference, we can obtain existing approaches (e.g., Lagrangian relaxation) and derive novel approaches that lead to better performance. One of these is an algorithm that learns a set of constraint-satisfying policies, useful for when we do not know the exact constraint a priori.
LP3 Framework
In the LP3 framework, the common approach of Lagrangian relaxation corresponds to using linear scalarization in Module 1 as the multi-objective RL algorithm, and learning a single preference in Module 2.
Based on the LP3 framework, we investigate two potential improvements over Lagrangian relaxation:
In Module 1, using MO-MPO instead of linear scalarization. This allows us to train policies that better optimize for the given preferences, since MO-MPO has been shown to outperform linear scalarization in continuous control domains.
In Module 2, learning a probability distribution over preferences rather than a single preference. This allows us to recover a set of constraint-satisfying policies that make different trade-offs, rather than just a single policy.
This leads us to two novel algorithms for constrained RL, LP3 [MO-MPO] and LP3 [MO-MPO-D]. The former incorporates the first improvement, and the latter incorporates both improvements.
Takeaways
Compared to existing approaches for constrained RL, LP3 [MO-MPO(-D)]
is more sample-efficient, both in terms of solving the task and satisfying the constraint
finds higher quality solutions, for concave Pareto fronts and difficult-to-satisfy constraints
enables finding a set of constraint-satisfying policies, rather than just a single one
and can do this for a task with four objectives, two of which have constraints
better solves problems with soft constraints
To view videos for each of these points, click on the links above or use the navigation bar.