Sample Efficiency

In robot push, the task reward is given for pushing the green block to the target position (in the far left corner, from the robot's perspective) and the constraint is on on the total per-episode cost, accumulated from either the gripper or green block being close to the yellow block.

On this task, policies trained with LP3 [MO-MPO-D] are quicker at both improving task performance reducing the cost, compared to policies trained with the Lagrangian-based baseline LP3 [LS]. The plot shows the learning curves, with standard error across five seeds. Notice that LP3 [MO-MPO-D] learns more quickly, and has consistent learning performance across seeds.

After 100 million actor steps of training, we can qualitatively see differences between the trained policies, deployed in the same evaluation environments. The two policies shown below were trained for a constraint threshold of -10 (expected cost per episode). The policy trained with LP3 [MO-MPO-D] is able to succeed at the task, even in challenging scenarios (i.e., when the yellow block is blocking the path to the goal). It has learned how to carefully maneuver the green block around the yellow block to reach the target.

LP3 [LS]

can't push the green block to the target for more challenging initial setups (e.g., second and third)

LP3 [MO-MPO-D]

pushes the green block to the target, with minimum cost