Human-compatible driving partners through
data-regularized self-play reinforcement learning
Daphne Cornelisse, Eugene Vinitsky
NYU
Daphne Cornelisse, Eugene Vinitsky
NYU
Abstract
A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting.
To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent approach where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations.
We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios.
Method
Step 1: Imitation learning
Obtain a human reference policy τ through imitation learning on human driving demonstrations.
Step 2: Guided self-play
We use the KL-divergence between τ and π as a regularization term for the standard Proximal Policy Optimization (PPO) objective. The hyperparameter lambda balances both objectives.
We train agents in self-play using the objective below. Agents are trained in multi-agent settings with up to 50 agents per scenario.
Partially observable multi-agent navigation tasks
We train and evaluate HR-PPO and baseline agents in a challenging multi-agent benchmark: Partially observable navigation in Nocturne.
👈 Here is an example scenario.
The goal of every vehicle is to reach its assigned target position 🎯 (colored circles on the left) without colliding or going off the road. Agents obtain sparse rewards when they reach their goal position before the end of the 80-step episode. If an agent reaches its goal, it receives a reward of +1. Otherwise, it receives a reward of 0. The reward function is intentionally simplified.
A challenging aspect of this task is that vehicles have partial visibility; there can be objects or vehicles outside their field of view. In the example, you can see the visible view of the controlled blue agent on the right.
Results
1. Can agents drive in a human-like way? Is there a trade-off between performance and realism?
We find that effectiveness (being able to navigate to a goal without colliding) and realism (driving in a human-like way) can be achieved simultaneously.
HR-PPO agents achieve similar performance to PPO while experiencing substantial improvements in human likeness across four different realism metrics.
(Details in paper Section 3.4)
2. Are HR-PPO agents more compatible with human driving?
We examine whether agents are compatible with the human driving logs as a proxy for the ability to coordinate with human drivers.
HR-PPO agents exhibit better performance when paired with human driving logs, outperforming BC and PPO on the train and test dataset. The effectiveness of HR-PPO agents is especially visible in interactive scenarios, where they outperform PPO by 20-40%. The Figure on the left shows that the PPO collision rate increases with the level of interactivity in a scenario. In contrast, the collision rate of HR-PPO agents remains low.
(Details in paper Section 3.5)
3. How can we explain Human-Regularized PPO agents' low collision rates in interactive scenarios?
We conduct a qualitative analysis 🕵️ to find out why HR-PPO agents perform better with human driving logs. After analyzing the driving behavior of PPO and HR-PPO agents in 50 randomly sampled scenarios, we conclude that the lower collision rates can be attributed to two main factors:
HR-PPO agents' driving styles resemble that of human drivers, making it easier to anticipate the actions of the human driving logs.
HR-PPO agents maintain more distance from other vehicles, thereby reducing ⬇️ the risk of collisions.
To illustrate the difference in driving styles between PPO and HR-PPO agents, we include a subset of videos below 👇
Example driving behaviors
Single-agent control 🚗💨 in a closed-loop environment with human driver logs
The policy-controlled vehicle is highlighted in red. The grey vehicles are stepped using the static human driving logs.
Failure to wait for another vehicle.
Successful coordination.
Zigzagging on the highway.
The vehicle stays in line.
Failure to go around a vehicle.
Keeping a safe distance.
Failure to coordinate in intersection.
The vehicle stops and waits for the other to pass.
Effective but problematic goal-reaching behavior.
(we control the green vehicle)
Realistic and effective goal-reaching behavior.
(we control the same vehicle as on the LHS, but here its color is blue)
Multi-agent control (self-play) 🚗💨 🚕💨 🚙💨
All vehicles in the scene are policy-controlled.
Roundabout
Roundabout
Highway
Highway
All vehicles reach their destinations, but the yellow car cuts off the pink one.
All vehicles arrive at their destinations in a more human like way.
HR-PPO failure cases
We analyzed 100 scenarios each from the train and test datasets in log-replay mode and identified 3 types of failure cases of the HR-PPO agents. We show 3-6 examples for each category, where the red vehicle is policy-controlled.
1 - Sharp turns
Off-road events due to kinematically challenging turns or target positions. These make up approximately 25% of failures.
Sampled from train dataset
Sampled from train dataset
Sampled from train dataset
Sampled from test dataset
Sampled from test dataset
Sampled from test dataset
2 - Coordination
Collisions due to failure to anticipate human driving log behavior. These make up approximately 35% of failures.
Sampled from train dataset
Sampled from train dataset
Sampled from test dataset
Sampled from train dataset
Sampled from train dataset
3 - Setting-related bugs/failures
Target positions that were unreachable or errors related to the fixed driving logs. Make up approximately 35% of failures.
Sampled from train dataset
Sampled from train dataset
Sampled from test dataset
Sampled from train dataset
Sampled from test dataset
Cite
@article{cornelisse2024human,
title={Human-compatible driving agents through data-regularized self-play reinforcement learning},
author={Cornelisse, Daphne and Vinitsky, Eugene},
journal={Reinforcement Learning Journal},
volume={1},
issue={1},
year={2024}
}