Soft Constraints
Constraints should be treated as soft constraints [1] when:
violating the constraint is undesirable but not catastrophic, or
satisfying the constraint is actually infeasible.
LP3 [MO-MPO-D] can solve problems with soft constraints, by finding a set of Pareto optimal policies that violate the constraints up to X% of the time. This set will contain:
policies that achieve higher task reward by violating the constraint occasionally, and
policies with zero constraint violation, if the constraint is feasible.
[1] Calian et al. Balancing Constraints and Rewards with Meta-Gradient D4PG. ICLR 2021.
LP3 [MO-MPO-D] outperforms the state-of-the-art approach (MetaL) on tasks with soft constraints.
Across all four tasks, LP3 [MO-MPO-D] policies obtain the highest task reward and lowest cost.
When it is possible to meet the constraint (i.e., for quadruped), LP3 [MO-MPO-D] is the only algorithm to find constraint-satisfying policies.
Below are examples of policies found by LP3 [MO-MPO-D].
Cartpole balance: constraint on angular velocity of pole when it is near the top
LP3 [MO-MPO-D] policies solve the task near-perfectly, with minimal constraint violation (that is better than that of all baselines).
Walker walk: joint velocity constraint
LP3 [MO-MPO-D] policies discover different walking styles, with minimal constraint violation.
Videos are from different random seeds.
Quadruped walk: joint angle constraint
LP3 [MO-MPO-D] policies discover shuffling styles with zero constraint violation. No baselines are able to find constraint-satisfying policies.
Videos are from different random seeds.
Humanoid walk: joint angle constraint
LP3 [MO-MPO-D] policies discover walking styles that keep all joint angles at near-zero. The constraint violation is less than that of all baselines.
Videos are from different random seeds.