The rapid advancement of autonomous vehicle technology has brought unprecedented challenges in training robust driving agents. While real-world training data is essential, it presents significant limitations in terms of safety, cost, and exposure to rare but critical driving scenarios. This research explores an innovative approach to autonomous vehicle training that leverages simulation environments to their full potential, building upon the "Learn by Cheating" framework [1] while introducing novel methods to address the long-tail distribution of challenging driving scenarios through systematic data collection and augmentation.
Recent work has demonstrated the effectiveness of using privileged information in simulated environments to train expert agents, which then serve as teachers for more practical vision-based driving agents [1,2]. However, even these expert agents struggle with rare but critical edge cases that occur in real-world driving. Our work proposes a systematic approach to identifying, augmenting, and learning from these challenging scenarios through targeted simulation resets and parametric variations, effectively mining and expanding the long tail of driving situations to create more robust autonomous driving agents.
The fundamental challenge in autonomous vehicle development lies not in handling routine driving scenarios, but in managing the rare, challenging situations that can lead to accidents. Traditional training approaches face two significant limitations. First, collecting real-world data for these edge cases is prohibitively expensive and potentially dangerous. Second, even in simulations, these scenarios occur too infrequently under standard training conditions to significantly influence agent behaviour.
Our research attempts to address these limitations by introducing a novel crash-scenario mining and augmentation approach that takes full advantage of simulation capabilities. The proposed approach draws inspiration from "Sample Efficient Deep Reinforcement Learning via Local Planning" [5], which demonstrated the value of strategically revisiting states based on value uncertainty. While their work focused on general RL efficiency, we apply a similar principle to autonomous driving by identifying and revisiting critical states - specifically, those leading to crashes. By systematically collecting challenging scenarios, applying parametric variations and replaying challenging scenarios, we create a focused training curriculum that ensures our agents receive sufficient exposure to critical edge cases. This augmentation process includes adjusting the ego vehicle heading and the temporal distance from the crash point, allowing us to create a spectrum of scenarios with different difficulty levels. For instance, scenarios captured closer to the moment of collision present more challenging cases for the agent to solve, while those captured earlier provide opportunities to learn preventative behaviours. This approach is particularly powerful when combined with the "Learn by Cheating" framework, as it allows us to improve the upper bound of what the expert agent can achieve, ultimately leading to better performance in the student agent that operates under realistic constraints.
The proposed method offers several key advantages. Beyond eliminating the safety concerns and logistical challenges of collecting real-world data for dangerous scenarios, it provides a systematic way to not only identify failure cases but also to multiply their training value through augmentation. This approach creates a scalable solution to addressing the long-tail distribution problem in autonomous driving, potentially leading to more robust and reliable autonomous vehicles.
Our work employs the framework of Reinforcement Learning with Local Simulator Access (RLLS), as formalized in "The Power of Resets in Online Reinforcement Learning" [6]. In standard online RL, each episode τ begins from an initial state x₁ sampled from some initial state distribution T₀. RLLS extends this definition by allowing the agent to reset to any previously encountered state xₕ and continue the episode from that state.
We leverage this RLLS framework with a specific focus on safety-critical states. Formally, given a crash at time t, we store the state x_{t-dt} where dt is our chosen temporal offset. We then augment these saved states, creating a set of challenging scenarios.
Similar to how recent work has shown the benefits of revisiting uncertain states in RL training, our method identifies and returns to critical driving scenarios. However, rather than using value uncertainty as the criterion for state selection, we focus specifically on crash outcomes as indicators of states requiring additional learning. Furthermore, we extend this concept by not just revisiting these states, but actively augmenting them to create a richer training distribution. Our research methodology consists of two main phases: expert agent training (base-training) and scenario-augmented training (post-training). This section details the technical implementation of each phase and the key design decisions that enable our approach.
We begin by training an expert agent using reinforcement learning in a simulated environment. The agent leverages privileged information in the form of semantic bird's-eye view observations, which provide complete information about the driving scene. We do this for a fixed number of time-steps and call this phase "base training".
While in initial base training, we implement a systematic crash scenario collection process. During agent rollouts, we continuously maintain a rolling buffer of recent states. When a crash occurs, we backtrack by a variable time interval dt to capture the pre-crash scenario. Each scenario record contains the state information (position and heading) for all of the vehicles in the scene.
To maximise the value of collected crash scenarios, we implement an augmentation upon sampling from our buffer. Specifically, for each base scenario we are able to generate multiple variants by applying small perturbations to the ego heading.
After base training, we start a "post-training" phase. We integrate these collected and augmented scenarios into the training process using a hybrid approach:
With probability p, we initialise training episodes using scenarios from our crash buffer.
Once a scenario is picked from the buffer at random, we apply our augmentation function.
With probability (1-p), we use the initial state distribution as in the original training
The training process continues iteratively, with new crash scenarios being collected as training progresses. This can be interpreted as creating a dynamic curriculum that evolves with the agent's capabilities.
Although our approach is general, we implement this this system using the policy gradient method: Proximal Policy Optimisation (PPO) [3]. All of our training and evaluation is carried out in the driving simulator CARLA [4]. The reward structure is based on gym-carla (as shown here) and its composed of additive factors including negative reward for collisions, negative reward for steering, negative reward for going out of lane, positive reward proportional to longitudinal speed, negative reward for going too fast and negative reward for lateral acceleration. Termination of an episode occurs if there is a collision, the maximum time-steps were reached, the agent reached the destination, or the agent is quite a bit out of lane. All of our results are based on Town 3 from Carla and assume that the town is launched with 40 other vehicles in it.
Our expert agent receives observations in the form of a semantic bird's-eye view (BEV) representation combined with relevant state information. The BEV representation is a 128 x 128 RGB image centered on the agent. The image encodes 9 semantic classes with different colors: road, lanes, centerlines, waypoints, vehicles, agent, green lights, yellow lights and red lights. Additionally, we provide the agent with state information that includes: lateral distance and heading error between the ego vehicle to the target lane center line (in meter and rad), ego vehicle's speed (in meters per second),and an indicator of whether there is a front vehicle within a safety margin. The agent produces continuous actions in two dimensions acceleration [-1.0, 1.0] and steering [-0.3, 0.3].
Following [2], we implement the expert agent using a CNN-based architecture that outputs a Gaussian distribution over actions. The network processes the BEV input through convolutional layers while the state information is processed through fully connected layers. These features are then combined to predict the mean and variance of the action distribution.
In order to evaluate agent performance under our resetting strategy and without it, we adopt two evaluation schemes. The first is "global robustness", which helps us understand cumulative reward after post-training with and without resets. The second is "crash focussed", which helps us understand cumulative reward but under a sample of crash scenarios encountered by the base-trained policy (i.e. before post-training) outside of training i.e. through rollouts. We will expand on these methods below. Before that, we list all the hyper-parameters we used (at least the ones you need to specify to launch similar experiments, excluding default ones that we do not override).
Base training
reset = false
learning rate = 3e-5
num minibatches = 32
update epochs = 10
total timesteps = 524288
num steps (max steps in an episode) = 1024
Post training
Same as base training, with the exception of:
reset = true (for the reset agent)
p = 0.5 (for the reset agent)
reset = true (for the no-resets agent)
p = 0.0 (for the no-resets agent)
total timesteps = 16384
We run a PPO agent for around 500K steps and then we save the base training policy. We then start post-training from this policy. In post-training we run a PPO agent for around 16K steps with our reset strategy and without our reset strategy (just normal PPO). We save the post-training policy for the reset and no-reset agents. Then, we simply run the resulting frozen policies for 30 episodes while recording the total episodic reward each of them accumulates. We do this for three differently seeded post training policies.
As we can see, results can vary wildly between differently seeded post training policies (that start from the same base training policy). The main observation here however, is that resets do not seem to be resulting in more episodic reward in general. In fact, they seem to be harming general performance. One issue with this evaluation scheme is that it does not specifically evaluate on crash scenarios; it just evaluates on a random scenario (most of which are non-crashing). We address this in the next scheme to gain more insight into crash scenario behaviour.
We run a PPO agent for around 500K steps and then we save the base training policy. We then start post-training from this policy. In post-training we run a PPO agent for around 16K steps with our reset strategy and without our reset strategy (just normal PPO). We save the post-training policy for the reset and no-reset agents. Using the saved base training policy, we perform rollouts and collect 10 crash scenarios. We think of these as scenarios of crashes that are likely to be encountered by the base training policy but that the agent never particularly sees in post-training (i.e. an isolated scenario test set). Then, we use each crash scenario to start an episode where we record the total episodic reward when using each post-training policy : with resets and no resets.
This came as a surprise. Resets seem to be harming performance in crash focussed scenarios too. There are several hypotheses that we can investigate going from here, but they were outside the scope of this class project. We discuss them in the next section.
Note raw files containing all of our base-training and post-training policies and results shown in plots are attached here.
In our results, we observe that resets in post training were actually harming both global performance and crash focussed performance. This is a bit counterintuitive at first but it actually might not be. We think that what we are seeing can be either coming from plasticity issues. It is only once those are eliminated as alternative hypotheses, can we more reliably conclude something about resets. Plasticity issues arise when a neural network fails to learn new things after it has been trained for a while. This can be alleviated by trying final layer weight resets periodically. Also, in the interest of simplifying the training process, we split it into post training and base training, but there might be potential for mixing in crash data much earlier before the network starts saturating. Evaluation will be slightly more complicated since there is no clear common policy starting point for post training, but we can make other evaluation schemes that account for this. While results are not as we expected them to be, we believe that with simple changes, there might be some real potential for resets improving robustness especially given new emerging theoretical results to support this. We view our current work as one step in our role as empiricists is to rigorously characterize the empirical conditions under which both failure and success might happen in resetting algorithms.
There are several avenues for future research that could improve our proposed approach.
Firstly, our method could benefit from a more comprehensive augmentation pipeline. Some ideas for additional dimensions for scenario augmentation:
In addition to perturbations in heading, the same could be applied to position and velocity.
Perturbations can be applied to all of the vehicles in the scene. The ego vehicle can also be spawned in the place of any other vehicle. For example, given a collision scenario involving two vehicles, the ego vehicle can now take on the place of either vehicle in the scenario.
Creating a curriculum around varying the temporal distance (dt) from the crash point to create scenarios of different difficulty levels. Currently the temporal distance dt is set once for the entire training run.
Ensuring all augmented scenarios maintain physical feasibility and traffic rule compliance
Secondly, in the current implementation sampling probability p remains constant through training which could be improved. One idea is to dynamically adjust p based on the agent's current performance on crash scenarios.
[1] Chen, Dian, Brady Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2019. “Learning by Cheating.” arXiv [Cs.RO]. arXiv. https://arxiv.org/abs/1912.12294
[2] Zhang, Zhejun, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. 2021. “End-to-End Urban Driving by Imitating a Reinforcement Learning Coach.” arXiv [Cs.CV]. arXiv. https://arxiv.org/abs/2108.08265
[3] Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv [Cs.LG]. arXiv. https://arxiv.org/abs/1707.06347
[4] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning (CoRL), pages 1–16, 2017.
[5] Yin, Dong, Sridhar Thiagarajan, Nevena Lazic, Nived Rajaraman, Botao Hao, and Csaba Szepesvari. 2023. “Sample Efficient Deep Reinforcement Learning via Local Planning.” arXiv [Cs.LG]. arXiv. https://arxiv.org/pdf/2301.12579
[6] Mhammedi, Zakaria, Dylan J. Foster, and Alexander Rakhlin. 2024. “The Power of Resets in Online Reinforcement Learning.” arXiv [Cs.LG]. arXiv. https://arxiv.org/abs/2404.15417