Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding

and Generalization Guarantees

Kai-Chieh Hsu*, Allen Z. Ren*, Duy Phuong Nguyen, Anirudha Majumdar+, Jaime F. Fisac+

Princeton University

*equal contribution in alphabetical order; +equal advising

Accepted to Special Issue on Risk-aware Autonomous Systems: Theory and Practice, Artificial Intelligence Journal
Progress and Challenges in Building Trustworthy Embodied AI Workshop, NeurIPS 2022

Oral, Generalizable Policy Learning in the Physical World Workshop, ICLR 2022

Paper

Paper (arXiv)

Code

Real Experiments with Ghost Spirit robot

representative_trials.mp4

We leverage an intermediate training stage, Lab, between Sim and Real to safely bridge the Sim-to-Real gap in ego-vision indoor navigation tasks. Compared to Sim training, Lab training is (1) more realistic and (2) more safety-critical.
For safe Sim-to-Lab transfer, we learn a safety critic with Hamilton-Jacobi reachability RL and apply a supervisory control scheme to filter unsafe actions during exploration.
For safe Lab-to-Real transfer, we use the Probably Approximately Correct (PAC)-Bayes Control framework to provide lower bounds (70-90%) on the expected performance and safety of policies in unseen environments.

Sim Training - Randomized Room Layout

Lab Training - Realistic Room Layout

2D slices of safety critic when the robot is facing to the right

Reachability vs. Discounted Risk

We compare our learned safety critic with those learned with SQRL [Srinivasan'20] and Recovery RL [Thananjeyan'21] that use sparse (binary) safety indicators. Reachability RL enables the safe critic to learn from near failure with dense signals. Thus, it recovers a thicker unsafe set and significantly reduces number of safety violations during Lab training and Real deployment.

Safety-Ensured Policy Distribution

We train a dual policy, performance (\pi^p) and backup (safety) policy (\pi^b), that is also conditioned on latent variables sampled from a distribution. As the performance policy guides the robot towards the goal, the backup policy intervenes minimally only when the safety critic deems the robot near danger. With a distribution of policies parameterized by the latent variables, the robot exhibits diverse trajectories around obstacles.

Shielding in Real Deployment

We test the dual policy in 10 different real indoor space including one shown on the left. The colored trajectory indicates the safety critic value (red for higher, blue for lower) at the locations. When the value exceeds the threshold, shielding activates and the backup policy (green arrow) overrides the performance policy (red arrow) to steer the robot away from obstacles.

Bibtex

@article{hsuren2022slr,

title = {Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees},

journal = {Artificial Intelligence},

pages = {103811},

year = {2022},

issn = {0004-3702},

doi = {https://doi.org/10.1016/j.artint.2022.103811},

url = {https://www.sciencedirect.com/science/article/pii/S0004370222001515},

author = {Kai-Chieh Hsu and Allen Z. Ren and Duy P. Nguyen and Anirudha Majumdar and Jaime F. Fisac},

keywords = {Reinforcement Learning, Sim-to-Real Transfer, Safety Analysis, Generalization}

}

Background

Reach-Avoid Reinforcement Learning

In this work, we generalize the reinforcement learning formulation to handle all optimal control problems in the reach-avoid category. We derive a time-discounted reach-avoid Bellman equation with contraction mapping properties and prove that the resulting reach-avoid Q-learning algorithm converges, yielding an arbitrarily tight conservative approximation to the reach-avoid set. We further demonstrate the use of this formulation with deep reinforcement learning methods, retaining zero-violation guarantees by treating the approximate solutions as untrusted oracles in a model-predictive supervisory control framework.

Paper

Code

PAC-Bayes Control: Learning Policies that Provably Generalize to Novel Environments

We utilize the Probably Approximately Correct (PAC)-Bayes framework, which allows us to obtain upper bounds that hold with high probability on the expected cost of (stochastic) control policies across novel environments. We propose policy learning algorithms that explicitly seek to minimize this upper bound. Our examples demonstrate the potential of our approach to provide strong generalization guarantees for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, rich sensory inputs (e.g., depth images), and neural network-based policies.

Paper

Code

Page updated

Report abuse