Projection-Based Constrained Policy Optimization (PCPO)
Tsung-Yen Yang*, Justinian Rosca**, Karthik Narasimhan*, Peter J. Ramadge*
*Princeton University, **Siemens Corporation, Corporate Technology
Tsung-Yen Yang*, Justinian Rosca**, Karthik Narasimhan*, Peter J. Ramadge*
*Princeton University, **Siemens Corporation, Corporate Technology
Introduction
Many autonomous systems such as self-driving cars and industrial robots are complex. In order to deal with this complexity, researchers are increasingly using reinforcement learning (RL) for designing control policies. However, there is one issue that limits the widespread deployment RL in the real system: how autonomous systems can learn robustly without violating the constraints with consideration of satey and other costs throughout exploration of the environment?
Challenges. Learning constraint-satisfying policies is challenging since the policy optimization landscape is no longer smooth. Further, in many cases, the constraints often conflict with the best direction of policy updates to maximize reward. Therefore, we require an algorithm that can make progress in terms of policy improvement without being shackled by the constraints and potentially getting stuck in local minima. A further challenge is that if we do end up with an infeasible (i.e., constraint- violating) policy, we need some efficient means of recovering back to a constraint-satisfying policy.
Contributions. To this end, we develop PCPO – a trust region method that performs policy updates corresponding to reward improvement, followed by projections onto the constraint set. Formally, PCPO, inspired by projected gradient descent, is composed of two steps for each policy update – a reward improvement step and a projection step. We provide theoretical guarantees on performance change for each policy update, and demostrate that our method achieves state-of-the-art performance in items of constraint violation and reward improvement in several benchmarks.
We also provide the code for reproducibility.
PCPO update
PCPO is a two step approach. In the first step, PCPO follows the reward improvement direction in the trust region. In the second step, PCPO projects the policy onto the constraint set.
This approach allows the agent to have freedom in improving the reward and remains constraint-satisfying, unlike the prior art that uses line search to slowly ensure constraint-satisfying.
Video demostration for bottleneck task
The agent controls a set of autonomous vehicles (shown in red) in a traffic merge situation and is rewarded for achieving high throughput but constrained to ensure that human-driven vehicles (shown in white) have low speed for no more than 10 seconds (This task is from Eugene Vinitsky, Aboudy Kreidieh, Luc Le Flem, Nishant Kheterpal, Kathy Jang, Fangyu Wu, Richard Liaw, Eric Liang, and Alexandre M. Bayen. "Benchmarks for reinforcement learning in mixed-autonomy traffic." In Proceedings of Conference on Robot Learning, 2018)
Our algorithm can learn the policy that maximizes the throughput while ensuring fairness among all drivers.