Julien Roy, Roger Girgis, Joshua Romoff, Pierre-Luc Bacon, Christopher Pal
Polytechnique Montréal, Université de Montréal, Ubisoft La Forge, Mila
Presented at: ICML 2022
Paper: https://proceedings.mlr.press/v162/roy22a.html
Code: https://github.com/ubisoft/DirectBehaviorSpecification
The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied RL projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods to automatically weigh each of these behavioral constraints. Specifically, we investigate how CMDPs can be adapted to solve goal-based tasks while adhering to several constraints simultaneously. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.
We explore how CMDPs can be used to directly shape behavior in a first set of 3D environments consisting of an enclosed "arena" in which the agent learns to navigate from a starting point (gray tile) to a goal (green tile) while being subject to a set of constraints. The location of the start and goal tiles, look-at-marker and lava ponds are randomized at every episodes.
In particular, we are interested in how the number of constraints contributes to make this task increasingly challenging.
The ArenaEnv can contain up to 5 different constraints:
Battery constraint: the agent is equiped with an energy bar. The energy level goes down at every time-step. The agent has a "recharge" action which imobilizes it in place while recharging its battery. The constraint is that the agent shouldn't find itself with an empty battery more than 1% of the time.
Look-at constraint: the agent is equiped with a field-of-view and the environment contrains a look-at-marker which must be maintained inside its field-of-view. The constraint is that the marker should not be outside of its field-of-view more than 10% of the time.
Lava constraint: the arena is filled with small procedurally generated lava ponds. The constraint is that the agent should not find itself in lava more than 1% of the time.
Jump constraint: the constraint is that the agent cannot jump more than 40% of the time.
Speed constraint: the constraint is that the agent cannot find itself above the speed limit (set to 75% of its maximum speed) more than 1% of the time.
In this first set of experiments, our SAC-Lagrangian agents are trained to solve this navigation task while being subject to only one of the constraints illustrated above. After 3M steps of training, the agent is generally able to solve the task while respecting any of these constraints individually.
Look-at constraint
Jump constraint
Lava constraint
Speed constraint
Battery constraint
Here we demonstrate that when training the agent to respect all of the constraints simultaneously, the task becomes much more challenging.
When training a SAC agent on that task with the main reward function only, the problem becomes simple; the agent rushes to its goal but by doing so it also runs out of energy, ignores the look-at marker, walks into lava, jumps incessantly and uses maximum speed. All of the behavioral requirements are unknown to the agent and are not trivially solved in this environment.
By adding a Lagrangian Wrapper around the SAC algorithm, the agent now tries to satisfy all the constraints simultaneously. Unfortunately, this has become a much more challenging problem and here the agent learns a trivial policy that respect the five constraints but in most episode, simply stays immobile to avoid any constraint violation.
Trying to enforce the 5 constraints without bootstrap constraint.
This can be explained because of the fact that the exploration problem has become much more difficult as many constraints work against one another. For example, the agent needs to jump to go over lava, but is prevented to jump too often by another constraint. Learning to respect the jumping constraint thus significantly slows down the discovery of the behavior than one can jump over lava to progress toward the goal while respecting the lava-constraint. Another source of difficulty is that due to the complex learning dynamics of modern deep RL methods and the phenomenom of (catastrophic) forgetting of past behaviors, the agent cycles its attention over each of the constraints without being able to maintain a feasible policy and explore the feasible space for long enough to actually start making progress on the main task (that of navigating towards the goal).
By using a bootstrap constraint, the agent is able to start making progress on the task while learning to respect the other constraints, resulting in the desired behavior.
Because the constraints were intuitive to define using indicator-cost-constraints, the entire process of shaping the agent's behavior can be executed in just a few tries rather than requiring endless tuning of the reward function!
Enforcing the 5 constraints with an additional bootstrap constraint.
In the second experimental setup, we verify the scalability of this approach. To do so, we experiment with enforcing some of the constraints demonstrated in the ArenaEnv on a much larger 3D environement which we call the OpenWorld.
In this task, the agent still tries to navigate to a goal. However the map is now equiped with buildings, hills, plateaus and jump-pads.
Our experiments on the Open World show that our proposed approach scales well to larger, more complicated environments. In these experiments, the agent is tasked with navigating towards the goal while satisfying four behavioral constraints:
looking at the marker at least 90% of the time.
Not using the jump action more than 40% of the time.
Not stepping in lava more than 0.1% of the time.
Being above the minimum energy level 99% of the time.
We can see qualitatively (right) and quantitatively (below) that the resulting agent performs the task adequately. This experiment also used the bootstrap constraint as in the final Arena experiment.
In this work, we argue for the use of Lagrangian methods for behavior specification. This approach allows the RL practitioner to quickly specify behavior using indicator functions instead of having to perform a vast hyper-parameter search over the weights of reward components. We evaluate this framework on the many constraints case in two different environments. Our experiments show that simultaneously satisfying a large number of constraints is difficult and can perpetually prevent the agent from improving on the main task. We first propose to normalize the constraint multipliers, which results in improved stability during training, and suggest to bootstrap the learning on the main objective to avoid getting trapped by the composing constraint set. Our overall method is both easy to implement on top of any existing policy gradient system and can scale across domains with minimal effort from the RL practitioner. Moreover, since the CMDP framework naturally reduces to a regular MDP when no constraint are specified, this framework can be used as a single, by-default method to both constrained and unconstrained problems. We hope that these insights can contribute to a wider use of Constrained RL methods in industrial application projects, and that such adoption can be mutually beneficial to the industrial and research RL communities.