Guided Online Distillation:
Promoting Safe Reinforcement Learning by Offline Demonstrations
Jinning Li, Xinyi Liu, Banghua Zhu, Jiantao Jiao, Masayoshi Tomizuka, Chen Tang, Wei Zhan
Overview
Abstract
Online safe RL algorithms often start training policies from scratch, which can lead to conservative policies that impede exploration. Recent advances in deep offline Reinforcement Learning (RL) and Imitation Learning enabled autonomous agents to acquire competent decision-making policies from offline datasets, which can mitigate the conservativeness and guide online explorations. Large capacity models, e.g. decision transformers, have been proven to be competent in modeling the behaviors underlying offline demonstrations. However, these bulk policy networks cannot meet the computation speed requirements during inference time on real-world tasks such as autonomous vehicle planning. Moreover, data collected in real-world scenarios rarely contain dangerous cases (e.g., collisions), which makes it prohibitive for the policies to learn safety concepts. We therefore propose an offline-to-online training scheme in safe RL settings with improved performance and computation efficiency. We leverage the information obtained in the offline stage to guide the exploration during online RL distillation, which produces a lightweight policy network, surpassing the performance of the offline trained guide policy. Experiments in both benchmark safe RL tasks and real-world driving tasks based on Waymo Open Dataset demonstrate that the method can successfully distill lightweight policies and solve decision-making problems in challenging safety-critical scenarios.
We propose a training scheme, Guided Online Distillation (GOLD), for safety-critical scenarios where offline expert demonstration is available. It solves the problem caused by limited high-risk cases in offline datasets and conservative exploration in safe RL.
A Decision Transformer (DT) is first trained from offline demonstration, which later serves as the guide policy during online distillation. A lightweight exploration policy network is then trained (distilled) interactively in the task environment by safe RL methods.
GOLD in Safety Gym
Overview in Safety Gym
The following demonstrations are in safety-gym environments.
Goal: the agent car aims to reach the green goal area touching the purple dots as less as possible.
Button: the goal of the red car is to reach and press the yellow button without touching purple boxes or dots.
Push: the goal of the red car is to push the yellow box to the green goal point.
The level of difficulty increases gradually. From the demos, we show GOLD can achieve benchmark tasks with high reward while maintaining cost below threshold.
Goal
Button
Push
Guide Policy Influences on Performance
In the following demos, we show a comparison of GOLD with guide policies of different quality. Although both are trained on the same offline dataset, DT performs better than BC as a guide policy.
With the guidance of pre-trained offline policies, GOLD avoids the exploration that leads to many failures before success and can discover highly lucrative solutions. The better the guide policy is, the better performance GOLD reaches.
GOLD (BC-IQL)
GOLD (DT-IQL): the Proposed Method
GOLD in MetaDrive
The demostrations on MetaDrive & Waymo Open Motion Dataset show our method is applicable to and effective in realistic scenarios. These experiments are fairly close to real-world scenes because we make MetaDrive replay vehicle trajectories from WOMD. The observations input to the ego agent, including Lidar cloud points, navigation information, and ego states, also resemble the real-world setting. The goal of the ego vehicle is to arrive at a specific target position defined in WOMD.
The following comparison is between GOLD (DT-IQL) and IQL trained from scratch. It validates that bringing in prior skills during online distillation is necessary for learning high-quality policy in real-world safety-critical scenarios. CVPO trained from scratch hardly learns to safely turn in intersections. It explores in the environment with a primary goal of dodging human cars, therefore, its trajectory is either too jerky (e.g. in the left turn) or too conservative (e.g., in the right turn). In contrast, GOLD maintains a steady progress along navigation points, and successfully reaches the goal point.
Case 1 -- Left turn
CVPO
GOLD (DT-IQL): the Proposed Method
Case 2-- right turn
CVPO
GOLD (DT-IQL): the Proposed Method
Case 3-- straight lane
CVPO
GOLD (DT-IQL): the Proposed Method