SLES: Safe Distributional-Reinforcement-Learning-Enabled Systems: Theories, Algorithms, Experiments

We would like to acknowledge the support from NSF CISE: "Collaborative Research: SLES: Safe Distributional-Reinforcement-Learning-Enabled Systems: Theories, Algorithms, Experiments". This project involves the following institutions: University of Michigan (lead institution), Arizona State University, The Ohio State University.

Overview: Reinforcement learning (RL), with its success in gaming and robotics, has been widely viewed as one of the most important technologies for next-generation, learning-enabled systems such as 6G, autonomous driving, digital healthcare, and smart cities. However, despite the significant advances over the last few decades, a major obstacle in applying RL in practice is the lack of ``safety guarantees such as robustness, resilience to tail risks, operational constraints, fairness, etc. This is because the traditional RL only aims at maximizing cumulative reward. While it is possible to add penalties to rewards in a traditional RL algorithm to discourage unsafe actions, many safety constraints, such as chance constraints, cannot be simply treated as penalties.  This project will develop foundational research for safe RL-enabled systems based on Distributional Reinforcement Learning (DRL), which learns the optimal policy from the value distribution instead of from its mean. 

Mathematically, this project considers a risk-sensitive constrained distributional reinforcement learning (CDRL) formulation where policy safety concerns solving the problem, exploration safety concerns the safety when learning the safe policy, and environmental safety concerns model misspecification and nonstationarity when solving the problem during both learning and implementation.

Thrust 1: Foundation of CDRL

Lead: PI Yu

This thrust aims to establish theoretical foundations of CDRL and focuses on policy and environmental safety. Having full information on reward distribution provides a unified framework for handling a variety of risk measures. However, the fundamental properties of this risk-sensitive DRL framework have yet to be fully understood. For example, what safety and risk guarantees can DRL provide, beyond the traditional RL? How can we guarantee global convergence under DRL and what is the fundamental iteration/sample complexity (i.e., the minimum number of iterations/samples required for learning the optimal policy)?

Thrust 2: Online CDRL

Lead: PI Ying

Our focus in Thrust 1 is to learn a safe policy using DRL under a variety of risk measures and constraint types. Since reinforcement learning is an iterative process where an agent interacts with an environment, takes actions, and receives signals of rewards, penalties and constraint violations, an equally important consideration of safe RL-enabled systems is the safety during the learning. In other words, we need to safely learn a safe policy. Therefore, this thrust devotes to safe online learning and decision-making under CDRL, i.e., the safety when learning the safe policy. 

Thrust 3: Physics-Enhanced CDRL

Lead: PI Liu

This thrust focuses on embedding physics in CDRL to improve the policy, exploration, and environmental safety studied in Thrusts 1 and 2. One unique contribution of this thrust is to augment the CDRL with physics and knowledge for enhanced predictive power and risk awareness, especially in unseen and unknown scenarios. This thrust will develop methods to embed physics into end-to-end learning, including augmenting input data, augmenting learning models, and argument loss functions.

Thrust 4: Validation and Experiments

Lead: PI Zhang

This thrust thoroughly evaluates the proposed CDRL framework using autonomous drones in three different missions: 1) environment monitoring, 2) tracking moving objects, 3) payload delivery. A set of risk factors will be identified to evaluate the performance of the CDRL when 1) the environment remains consistent between training and testing (policy safety), 2) there are dynamic changes in the environments (exploration safety), and 3) obstacles and disturbances exist in the environment (environmental safety). We will start with high-fidelity simulation to collect training data and evaluate the CDRL performance before implementing the CDRL in quadrotors for experiments.