Soft Constraints

Constraints should be treated as soft constraints [1] when:

  • violating the constraint is undesirable but not catastrophic, or

  • satisfying the constraint is actually infeasible.

LP3 [MO-MPO-D] can solve problems with soft constraints, by finding a set of Pareto optimal policies that violate the constraints up to X% of the time. This set will contain:

  • policies that achieve higher task reward by violating the constraint occasionally, and

  • policies with zero constraint violation, if the constraint is feasible.

[1] Calian et al. Balancing Constraints and Rewards with Meta-Gradient D4PG. ICLR 2021.

LP3 [MO-MPO-D] outperforms the state-of-the-art approach (MetaL) on tasks with soft constraints.

Across all four tasks, LP3 [MO-MPO-D] policies obtain the highest task reward and lowest cost.

When it is possible to meet the constraint (i.e., for quadruped), LP3 [MO-MPO-D] is the only algorithm to find constraint-satisfying policies.

Below are examples of policies found by LP3 [MO-MPO-D].

Cartpole balance: constraint on angular velocity of pole when it is near the top

LP3 [MO-MPO-D] policies solve the task near-perfectly, with minimal constraint violation (that is better than that of all baselines).

Walker walk: joint velocity constraint

LP3 [MO-MPO-D] policies discover different walking styles, with minimal constraint violation.

Videos are from different random seeds.

Quadruped walk: joint angle constraint

LP3 [MO-MPO-D] policies discover shuffling styles with zero constraint violation. No baselines are able to find constraint-satisfying policies.

Videos are from different random seeds.

Humanoid walk: joint angle constraint

LP3 [MO-MPO-D] policies discover walking styles that keep all joint angles at near-zero. The constraint violation is less than that of all baselines.

Videos are from different random seeds.