Simulation to scaled city: zero-shot policy transfer for traffic control via autonomous vehicles

Abstract | Using deep reinforcement learning, we train control policies for autonomous vehicles leading a platoon of vehicles onto a roundabout. Using Flow, a library for deep reinforcement learning in micro-simulators, we train two policies, one policy with noise injected into the state and action space and one without any injected noise. In simulation, the autonomous vehicle learns an emergent metering behavior for both policies in which it slows to allow for smoother merging. We then directly transfer this policy without any tuning to the University of Delaware Scaled Smart City (UDSSC), a 1:25 scale testbed for connected and automated vehicles. We characterize the performance of both policies on the scaled city. We show that the noise-free policy winds up crashing and only occasionally metering. However, the noise-injected policy consistently performs the metering behavior and remains collision-free, suggesting that the noise helps with the zero-shot policy transfer. Additionally, the transferred, noise-injected policy leads to a 5% reduction of average travel time and a reduction of 22% in maximum travel time in the UDSSC.

Link to arXiv article here: https://arxiv.org/abs/1812.06120

Link to the proceedings will be added once they are made available.


Below is a gallery of images and videos that takes you through our policy transfer process, including media of the policies we transferred successfully and behind-the-scenes media of policies which did not. Enjoy!

Scaling down with UDSSC

Videos of the policy deployed on UDSSC

Below you can view videos of the baseline on UDSSC with no RL control, side by side with videos of UDSSC with RL control.

Traffic Control In Simulation

Videos of the roundabout policy in simulation

Below you can view 3 videos of the roundabout in simulation.

*** All perturbations are chosen from a Gaussian distribution with a standard deviation of 0.1

This is the baseline of the roundabout with all RL vehicles replaced by IDM vehicles.

The clash at the critical section at the northwestern end of the roundabout shows the two inflows overlapping. This leads to jolty acceleration and decelerations.

This is the policy we used for policy transfer onto UDSSC.

RL actions are chosen from a continuous interval in [-1, 1], denoting acceleration, and noised by the above standard.

All entries in the state space are normalized to lie between [0, 1], and are then noised by the above standard.

This is an earlier version of the policy which we attempted to transfer onto UDSSC, but failed.

While in simulation, this policy performs very similarly to the above noisy policy, the precise requirements of the state space and expectations of the RL accelerations prevented a successful transfer.

On the Existence of Noise

Convergence of the noisy vs. noiseless experiments

As denoted by the blue line, the experiment with more noise starts training at a significantly lower reward, takes a longer time to train, and converges to a slightly less effective policy. Is this worth it? The difference is in the process of deploying the policy. See below for a comparison of a deployment run on noise versus a deployment run without noise.

Noisy vs. noiseless deployments

What we lose in reward we make up for in successful transfer sans 6-car pileups.

For one of the earlier policies we attempted to transfer to UDSSC, which is exactly the same as the policy we successfully transferred barring added state space noise and RL action noise, we encountered a number of issues which prevented a smooth transfer.

Over multiple tests, the transfer of the noiseless policy performed much worse. We had to redo the tests multiple times due to collisions between the RL vehicle from the western inflow and vehicles from the Northern inflow. In another test, the RL vehicle nearly learned the high-level behavior of ramp metering, but only let 2/3 or the vehicles from the northern inflow pass.

The two image below portrays what happens when policy transfer is done sans noise:

This is compared to the policy transfer of the noisy policy, which was consistent in its performance and led to no collisions. This suggests the inclusion of noise in training produces a policy which is more robust to the problem of zero-shot policy transfer.

This work is a product of Mobile Sensing Lab and its members: Kathy Jang, Eugene Vinitsky, and Prof. Alexandre Bayen; and the University of Delaware Information and Decision Science Laboratory: Behdad Chalaki, Benjamin Remer, Logan Beaver, and Prof. Andreas Malikopoulous.

For more info on Flow, check out https://flow-project.github.io.

For the individual lab websites look at: