Human-compatible driving partners through
data-regularized self-play reinforcement learning
Daphne Cornelisse, Eugene Vinitsky
NYU
Abstract
A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting.
To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent approach where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations.
We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios.
PPO
Human-Regularized PPO (ours)
Method
Step 1: Imitation learning
Obtain a human reference policy τ through imitation learning on human driving demonstrations.
Step 2: Guided self-play
We use the KL-divergence between τ and π as a regularization term for the standard Proximal Policy Optimization (PPO) objective. The hyperparameter lambda balances both objectives.
We train agents in self-play using the objective below. Agents are trained in multi-agent settings with up to 50 agents per scenario.
Partially observable multi-agent navigation tasks
We train and evaluate HR-PPO and baseline agents in a challenging multi-agent benchmark: Partially observable navigation in Nocturne.
👈 Here is an example scenario.
The goal of every vehicle is to reach its assigned target position 🎯 (colored circles on the left) without colliding or going off the road. Agents obtain sparse rewards when they reach their goal position before the end of the 80-step episode. If an agent reaches its goal, it receives a reward of +1. Otherwise, it receives a reward of 0. The reward function is intentionally simplified.
A challenging aspect of this task is that vehicles have partial visibility; there can be objects or vehicles outside their field of view. In the example, you can see the visible view of the controlled blue agent on the right.
Results
1. Can agents drive in a human-like way? Is there a trade-off between performance and realism?
We find that effectiveness (being able to navigate to a goal without colliding) and realism (driving in a human-like way) can be achieved simultaneously.
HR-PPO agents achieve similar performance to PPO while experiencing substantial improvements in human likeness across four different realism metrics.
(Details in paper Section 3.4)
2. Are HR-PPO agents more compatible with human driving?
We examine whether agents are compatible with the human driving logs as a proxy for the ability to coordinate with human drivers.
HR-PPO agents exhibit better performance when paired with human driving logs, outperforming BC and PPO on the train and test dataset. The effectiveness of HR-PPO agents is especially visible in interactive scenarios, where they outperform PPO by 20-40%. The Figure on the left shows that the PPO collision rate increases with the level of interactivity in a scenario. In contrast, the collision rate of HR-PPO agents remains low.
(Details in paper Section 3.5)
3. How can we explain Human-Regularized PPO agents' low collision rates in interactive scenarios?
We conduct a qualitative analysis 🕵️ to find out why HR-PPO agents perform better with human driving logs. After analyzing the driving behavior of PPO and HR-PPO agents in 50 randomly sampled scenarios, we conclude that the lower collision rates can be attributed to two main factors:
HR-PPO agents' driving styles resemble that of human drivers, making it easier to anticipate the actions of the human driving logs.
HR-PPO agents maintain more distance from other vehicles, thereby reducing ⬇️ the risk of collisions.
To illustrate the difference in driving styles between PPO and HR-PPO agents, we include a subset of videos below 👇
Example driving behaviors
Single-agent control 🚗💨 in a closed-loop environment with human driver logs
The policy-controlled vehicle is highlighted in red. The grey vehicles are stepped using the static human driving logs.
PPO
Failure to wait for another vehicle.
Human-Regularized PPO
Successful coordination.
PPO
Zigzagging on the highway.
Human-Regularized PPO
The vehicle stays in line.
PPO
Failure to go around a vehicle.
Human-Regularized PPO
Keeping a safe distance.
PPO
Failure to coordinate in intersection.
Human-Regularized PPO
The vehicle stops and waits for the other to pass.
PPO
Effective but problematic goal-reaching behavior.
(we control the green vehicle)
Human-Regularized PPO
Realistic and effective goal-reaching behavior.
(we control the same vehicle as on the LHS, but here its color is blue)
Multi-agent control (self-play) 🚗💨 🚕💨 🚙💨
All vehicles in the scene are policy-controlled.
PPO
Roundabout
Human-Regularized PPO
Roundabout
PPO
Highway
Human-Regularized PPO
Highway
PPO
All vehicles reach their destinations, but the yellow car cuts off the pink one.
Human-Regularized PPO
All vehicles arrive at their destinations in a more human like way.
Cite
@misc{cornelisse2024humancompatible,
title={Human-compatible driving partners through data-regularized self-play reinforcement learning},
author={Daphne Cornelisse and Eugene Vinitsky},
year={2024},
eprint={2403.19648},
archivePrefix={arXiv},
primaryClass={cs.RO}
}