Solution Quality

LP3 [MO-MPO(-D)] outperforms existing approaches on 7 out of 9 continuous control tasks, across three task suites (Sections 5.1 and 5.3 in the paper). Our main baseline is LP3 [LS], which is a Lagrangian-based approach; LS stands for linear scalarization. Here, we show the policies learned for humanoid walk and point goal.

In humanoid walk, the task reward is given for walking (in any direction) and the constraint is on the per-timestep action norm.

Each point in this plot corresponds to a policy trained for a different constraint threshold. For both axes, higher values are better; i.e., policies above and to the right have better performance.

For an easier constraint threshold of -2 (i.e., -2000 per episode), both LP3 [MO-MPO] and LP3 [LS] find similarly good policies, in terms of task reward and cost:

LP3 [MO-MPO]

LP3 [LS]

But for a more challenging constraint threshold of -0.9 (i.e., -900 per episode, indicated in the plot by the red line), only LP3 [MO-MPO] is able to find a constraint-satisfying policy that is still able to walk:

LP3 [MO-MPO]

LP3 [LS]

This could be because the ground-truth Pareto front has concave portions. Lagrangian relaxation relies on linear scalarization, which is fundamentally unable to find solutions on concave portions of a Pareto front.

In point goal, the task reward is given for reaching the goal and the constraint is on the total per-episode cost, accumulated from running into obstacles.

Even though this Pareto front is not concave, the Lagrangian relaxation baseline still struggles to find optimal policies for constraint thresholds that are difficult to satisfy (as shown in the plot). Thresholds between 0 and -8 are considered to be difficult.

Below are rollouts of the policies that LP3 [MO-MPO-D] finds for different thresholds (the thresholds of -7 and -3 count as difficult).

threshold = -15

agent takes the shortest path to the goal, running into obstacles often

threshold = -7

agent avoids obstacles, but still takes efficient paths to the goal

threshold = -3

agent is extra cautious, taking long detours around obstacles