Human-compatible driving partners

Human-compatible driving partners through

data-regularized self-play reinforcement learning

NYU

Abstract

A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting.

To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent approach where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations.

We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios.

ArXiv / Code / Tweet

PPO

Human-Regularized PPO (ours)

Method

Step 1: Imitation learning
- Obtain a human reference policy τ through imitation learning on human driving demonstrations.
Step 2: Guided self-play
- We use the KL-divergence between τ and π as a regularization term for the standard Proximal Policy Optimization (PPO) objective. The hyperparameter lambda balances both objectives.
- We train agents in self-play using the objective below. Agents are trained in multi-agent settings with up to 50 agents per scenario.

Partially observable multi-agent navigation tasks

We train and evaluate HR-PPO and baseline agents in a challenging multi-agent benchmark: Partially observable navigation in Nocturne.

👈 Here is an example scenario.

The goal of every vehicle is to reach its assigned target position 🎯 (colored circles on the left) without colliding or going off the road. Agents obtain sparse rewards when they reach their goal position before the end of the 80-step episode. If an agent reaches its goal, it receives a reward of +1. Otherwise, it receives a reward of 0. The reward function is intentionally simplified.

A challenging aspect of this task is that vehicles have partial visibility; there can be objects or vehicles outside their field of view. In the example, you can see the visible view of the controlled blue agent on the right.

Results

1. Can agents drive in a human-like way? Is there a trade-off between performance and realism?

We find that effectiveness (being able to navigate to a goal without colliding) and realism (driving in a human-like way) can be achieved simultaneously.

HR-PPO agents achieve similar performance to PPO while experiencing substantial improvements in human likeness across four different realism metrics.

(Details in paper Section 3.4)

2. Are HR-PPO agents more compatible with human driving?

We examine whether agents are compatible with the human driving logs as a proxy for the ability to coordinate with human drivers.

HR-PPO agents exhibit better performance when paired with human driving logs, outperforming BC and PPO on the train and test dataset. The effectiveness of HR-PPO agents is especially visible in interactive scenarios, where they outperform PPO by 20-40%. The Figure on the left shows that the PPO collision rate increases with the level of interactivity in a scenario. In contrast, the collision rate of HR-PPO agents remains low.

(Details in paper Section 3.5)

3. How can we explain Human-Regularized PPO agents' low collision rates in interactive scenarios?

We conduct a qualitative analysis 🕵️ to find out why HR-PPO agents perform better with human driving logs. After analyzing the driving behavior of PPO and HR-PPO agents in 50 randomly sampled scenarios, we conclude that the lower collision rates can be attributed to two main factors:

HR-PPO agents' driving styles resemble that of human drivers, making it easier to anticipate the actions of the human driving logs.
HR-PPO agents maintain more distance from other vehicles, thereby reducing ⬇️ the risk of collisions.

To illustrate the difference in driving styles between PPO and HR-PPO agents, we include a subset of videos below 👇

Example driving behaviors

Single-agent control 🚗💨 in a closed-loop environment with human driver logs

The policy-controlled vehicle is highlighted in red. The grey vehicles are stepped using the static human driving logs.