Set of Policies

Often an RL practitioner does not have an exact constraint threshold in mind, but instead would like to find all "good" policies with costs within a certain range.

LP3 [MO-MPO-D] solves this by finding a set of constraint-satisfying policies that are Pareto optimal.

Scroll down to the bottom to see results for humanoid walk with four objectives, two of which have constraints.

Humanoid run

We used LP3 [MO-MPO-D] to find a set of policies for humanoid run, with a per-timestep action norm cost of less than -3. Below are performance plots and videos for three separate training runs, with different random initializations. Each plot shows the performance of a single trained policy, conditioned on different quantiles of the learned preference distribution, and the pair of videos are from that single policy.

quantile = 0.05

fast; flails arms around

quantile = 0.75

same running style, but slower & more controlled

quantile = 0.05

quantile = 0.75

quantile = 0.05

quantile = 0.75

Humanoid walk, with 4 objectives

We used LP3 [MO-MPO-D] to find a set of policies for humanoid walk, with an action norm cost of less than -1.5 per timestep, and a move-left reward of greater than 400 per episode. The other two objectives, without constraints, are the original task reward (for walking in any direction) and a move forward reward.

Below are performance plots and videos. Each plot shows the performance of a single trained policy, conditioned on different quantiles of the learned preference distribution, and the pair of videos are from that single policy. Unlike for humanoid run above, in this task the solutions look quite similar across random seeds, so we only show videos for one policy.

As we increase the preference for the move-left objective (while keeping the preferences for all other objectives the same), the agent smoothly interpolates from walking forwards, to walking diagonally, to walking sideways to the left. This is shown in the videos below, from left to right. Note that the direction of movement is based on the torso velocity within the agent's egocentric frame.

move-left quantile = 0.2

cost quantile = 0.5

move-left quantile = 0.35

cost quantile = 0.5

move-left quantile = 0.9

cost quantile = 0.5