Home

Model-Predictive Policy Learning with Uncertainty Regularization for Driving in Dense Traffic

Mikael Henaff*¹², Alfredo Canziani*¹ and Yann LeCun¹

(*equal contribution)

¹Courant Institute, New York University

²Microsoft Research, NYC

Abstract

Learning a policy using only observational data is challenging because the distribution of states it induces at execution time may differ from the distribution observed during training. We propose to train a policy by unrolling a learned model of the environment dynamics over multiple time steps while explicitly penalizing two costs: the original cost the policy seeks to optimize, and an uncertainty cost which represents its divergence from the states it is trained on. We measure this second cost by using the uncertainty of the dynamics model about its own predictions, using recent ideas from uncertainty estimation for deep networks. We evaluate our approach using a large-scale observational dataset of driving behavior recorded from traffic cameras, and show that we are able to learn effective driving policies from purely observational data, with no environment interaction.

Model-Predictive Policy Learning with Uncertainty Regularization

Our approach consists of first learning an action-conditional stochastic forward model from a large observational dataset, and then using it to train a parameterized policy network. To do this, the forward model is unrolled for several time steps and the policy network is trained to minimize a loss over this trajectory.

Learning from purely observational data is challenging however because the data may only cover a small region of the space over which it is defined. The learned dynamics model may then make arbitrary predictions outside the domain it was trained on, which may wrongly be associated with low cost (or high reward). The policy network may then exploit these errors in the dynamics model and produce actions which lead to wrongly optimistic states

To address this, we propose an additional cost which measures the uncertainty of the dynamics model about its own predictions. This can be calculated by passing the same input and action through several different dropout masks, and computing the variance across the different outputs. A key point is that all the operations used for calculating this uncertainty measure are differentiable, which means we can pass gradients back to the policy network. This encourages the policy network to only produce actions for which the forward model is confident.

Dataset

We apply our approach to learn driving policies using a large-scale dataset of driving videos taken from traffic cameras. After recording, a viewpoint transformation is applied to rectify the perspective, and vehicles are identified and tracked throughout the video. We then extract RGB images representing the neighborhood around each car at each time step as well as action information representing acceleration and steering. These state-action pairs can then be used to train forward models.

Prediction Results

Here we show predictions using the deterministic and stochastic forward models trained on the above dataset. All predictions are given the same sequence of initial states as inputs, and then generate future states for a fixed number of time steps. Each state is an RGB image and a vector representing position and velocity (marked as x and dx). We show predictions 40 and 200 time steps into the future.

Learned Policies

Below we show learned policies evaluated on 10 random trajectories from the test set. The white dot represents the distribution over actions output by the policy network at each time step (location is mean, size is variance). If the dot is in front of the ego car this means it is accelerating, behind means braking, to the left or right means turning.

Action Sensitivity

Despite yielding more visually realistic predictions, using a standard stochastic forward model did not give significant gains in policy performance compared to a deterministic model, due decreased sensitivity with respect to the input actions. Having a forward model which responds well to actions is necessary for training a policy network. We found that a key modification was to change the posterior distribution from a single Gaussian (whose parameters are output by a learned network) to a mixture of two Gaussians, where one component is fixed to the prior. This forces the prediction model to extract as much information as possible from the input states and actions by making the latent variable independent of the output a fraction of the time. This can be seen as applying a form of global dropout to the latent variables, and we refer to this approach as z-dropout.

Below we visually compare predictions by the forward model, using different combinations of actions and latent variables.

Sampled Z means that the latent variables are sampled from the prior, independently at each time step.

Inferred Z means that the latent variables are inferred from the ground truth state trajectory, using the posterior network.

True A means that we input the true actions which occur in the ground truth state trajectory into the forward model.

Sampled A means that we input a different action sequence observed in the training set.

Stochastic Forward Model with diagonal Gaussian posterior

Observe the two rightmost predictions in each video. "Inferred Z - Sampled A" and "Sampled Z - Sampled A" both take the same action sequence as input, but the first uses latent variables which are inferred from the ground truth sequence using the posterior network. Even though they are both fed the same sequence of actions, the ego car appears to execute different action sequences (namely, it turns differently). In fact, the turning behavior in "Inferred Z - Sampled A" appears to match that in the ground truth, which the inferred latent variables were extracted from. This suggests that these latent variables encode action information and make the forward model less sensitive to actions. This can hurt the performance of the policy network as it is trained using a forward model which does not accurately reflect the effects of the actions it produces.

Note also that in the second frame from the left, the car does not turn as sharply as in the ground truth.

Stochastic Forward Model with z-dropout:

Here we see that in the second frame from the left, the car's turning better matches that in the ground truth. Our modified posterior helps maintain the sensitivity of the forward model to the actions by discouraging the model from encoding action information in the latent variables. This is because factors of variation in the output which are due to the actions cannot be encoded in the latent variables the times that they are sampled from the prior, and the loss can better be lowered by predicting them from the actions instead.