This thrust aims to establish theoretical foundations of CDRL and focuses on policy and environmental safety. Having full information on reward distribution provides a unified framework for handling a variety of risk measures. However, the fundamental properties of this risk-sensitive DRL framework have yet to be fully understood. For example, what safety and risk guarantees can DRL provide, beyond the traditional RL? How can we guarantee global convergence under DRL and what is the fundamental iteration/sample complexity (i.e., the minimum number of iterations/samples required for learning the optimal policy)?
Our focus in Thrust 1 is to learn a safe policy using DRL under a variety of risk measures and constraint types. Since reinforcement learning is an iterative process where an agent interacts with an environment, takes actions, and receives signals of rewards, penalties and constraint violations, an equally important consideration of safe RL-enabled systems is the safety during the learning. In other words, we need to safely learn a safe policy. Therefore, this thrust devotes to safe online learning and decision-making under CDRL, i.e., the safety when learning the safe policy.
This thrust focuses on embedding physics in CDRL to improve the policy, exploration, and environmental safety studied in Thrusts 1 and 2. One unique contribution of this thrust is to augment the CDRL with physics and knowledge for enhanced predictive power and risk awareness, especially in unseen and unknown scenarios. This thrust will develop methods to embed physics into end-to-end learning, including augmenting input data, augmenting learning models, and argument loss functions.
This thrust thoroughly evaluates the proposed CDRL framework using autonomous drones in three different missions: 1) environment monitoring, 2) tracking moving objects, 3) payload delivery. A set of risk factors will be identified to evaluate the performance of the CDRL when 1) the environment remains consistent between training and testing (policy safety), 2) there are dynamic changes in the environments (exploration safety), and 3) obstacles and disturbances exist in the environment (environmental safety). We will start with high-fidelity simulation to collect training data and evaluate the CDRL performance before implementing the CDRL in quadrotors for experiments.