Success at any cost: value constrained model-free continuous control

Abstract

The naive application of Reinforcement Learning (RL) algorithms to continuous control problems -- such as locomotion and manipulation -- often results in policies which rely on high-amplitude, high-frequency control signals, known colloquially as bang-bang control. Although such solutions may indeed maximize task reward they can be unsuitable for real world systems where bang-bang control may lead to increased wear and tear or energy consumption, and tends to excite undesired second-order dynamics. To counteract this issue, multi-objective optimization can be used to simultaneously optimize both the reward and some auxiliary cost that discourages undesired (e.g. high-amplitude) control. In principle, such an approach can yield the sought after, smooth, control policies. It can, however, be hard to find the correct trade-off between cost and return that results in the desired behavior.

In this paper we propose a new constraint-based approach which defines a lower bound on the return while minimizing one or more costs (such as control effort). We employ Lagrangian relaxation to learn both (a) the parameters of a control policy that satisfies the desired constraints and (b) the Lagrangian multipliers for the optimization. Moreover, we demonstrate policy optimization which satisfies constraints either in expectation or in a per-step fashion, and we learn a single conditional policy that is able to dynamically trade-off between return and cost. We demonstrate the efficiency of our approach using a number of continuous control benchmark tasks, a realistic, energy-optimized quadruped locomotion task, as well as reaching tasks on a real robot arm.

Minitaur locomotion

The videos below show policies trained on the locomotion task for the Minitaur quadruped robot. The robot is asked to move forward (left to right in the videos) across bumpy terrain as efficiently as possible. The plots in the top right show the lower bound on the velocity that the policy is trying to achieve (Target) as well as the actual instantaneous velocity (Current). The green rods appearing out of the Minitaur's main body indicate the magnitude of random external perturbations.

minitaur_fixed_0.1.mp4

v = 0.1 m/s

minitaur_fixed_0.3.mp4

v = 0.3 m/s

minitaur_fixed_0.5.mp4

v = 0.5 m/s

These policies are trained to optimize for electrical energy with respect to a (fixed) lower bound on the velocity, as listed below each video. We can observe that each policy tracks the lower bound pretty well, and some diversity in gaits can be seen across velocities.

minitaur_varying.mp4

The policy in the above video has been trained across a range of lower bounds on the velocity, which it observes. This allows us to dynamically change the goal velocity during the episode. As we increase and decrease the lower bound, so does the speed of the Minitaur.

Sawyer reaching

We train a policy to perform a reaching task with a visibility constraint: the Sawyer robot arm has to position the cube in its gripper to a virtual target location within its workspace, but keep the AR tags on the cube visible to the camera to be able to track it.

sawyer_occlusion.mp4