Human-compatible driving partners through

data-regularized self-play reinforcement learning

Daphne Cornelisse, Eugene Vinitsky

NYU

Abstract

A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting

To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent approach where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations. 

We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios.

ArXiv  /  Code / Tweet

PPO

Human-Regularized PPO (ours

Method

Partially observable multi-agent navigation tasks 

We train and evaluate HR-PPO and baseline agents in a challenging multi-agent benchmark: Partially observable navigation in Nocturne

👈 Here is an example scenario. 

The goal of every vehicle is to reach its assigned target position 🎯 (colored circles on the left) without colliding or going off the road. Agents obtain sparse rewards when they reach their goal position before the end of the 80-step episode. If an agent reaches its goal, it receives a reward of +1. Otherwise, it receives a reward of 0. The reward function is intentionally simplified.

A challenging aspect of this task is that vehicles have partial visibility; there can be objects or vehicles outside their field of view. In the example, you can see the visible view of the controlled blue agent on the right.

Results

1. Can agents drive in a human-like way? Is there a trade-off between performance and realism?

We find that effectiveness (being able to navigate to a goal without colliding) and realism (driving in a human-like way) can be achieved simultaneously.


HR-PPO agents achieve similar performance to PPO while experiencing substantial improvements in human likeness across four different realism metrics.



(Details in paper Section 3.4)

2. Are HR-PPO agents more compatible with human driving?

We examine whether agents are compatible with the human driving logs as a proxy for the ability to coordinate with human drivers. 

HR-PPO agents exhibit better performance when paired with human driving logs, outperforming BC and PPO on the train and test dataset. The effectiveness of HR-PPO agents is especially visible in interactive scenarios, where they outperform PPO by 20-40%. The Figure on the left shows that the PPO collision rate increases with the level of interactivity in a scenario. In contrast, the collision rate of HR-PPO agents remains low.


(Details in paper Section 3.5)

3. How can we explain Human-Regularized PPO agents' low collision rates in interactive scenarios? 

We conduct a qualitative analysis 🕵️ to find out why HR-PPO agents perform better with human driving logs. After analyzing the driving behavior of PPO and HR-PPO agents in 50 randomly sampled scenarios, we conclude that the lower collision rates can be attributed to two main factors:



To illustrate the difference in driving styles between PPO and HR-PPO agents, we include a subset of videos below 👇

Example driving behaviors 

Single-agent control 🚗💨 in a closed-loop environment with human driver logs

The policy-controlled vehicle is highlighted in red. The grey vehicles are stepped using the static human driving logs.

PPO

Failure to wait for another vehicle.

Human-Regularized PPO 

Successful coordination.

PPO

Zigzagging on the highway.

Human-Regularized PPO 

The vehicle stays in line.

PPO

Failure to go around a vehicle.

Human-Regularized PPO

Keeping a safe distance.

PPO

Failure to coordinate in intersection.

Human-Regularized PPO

The vehicle stops and waits for the other to pass.

PPO

Effective but problematic goal-reaching behavior. 

(we control the green vehicle)

Human-Regularized PPO

Realistic and effective goal-reaching behavior.

(we control the same vehicle as on the LHS, but here its color is blue)

Multi-agent control (self-play)  🚗💨 🚕💨 🚙💨

All vehicles in the scene are policy-controlled. 

PPO

Roundabout

Human-Regularized PPO

Roundabout

PPO

Highway

Human-Regularized PPO

Highway

PPO

All vehicles reach their destinations, but the yellow car cuts off the pink one.

Human-Regularized PPO

All vehicles arrive at their destinations in a more human like way.

Cite

@misc{cornelisse2024humancompatible,

  title={Human-compatible driving partners through data-regularized self-play reinforcement learning},

  author={Daphne Cornelisse and Eugene Vinitsky},

  year={2024},

  eprint={2403.19648},

  archivePrefix={arXiv},

  primaryClass={cs.RO}

}