Biasing forecasts in a real-world setting

For this experiment, we used an architecture resembles state-of-the-art forecasting architectures used in challenging, real-world scenarios. It is designed to model the interactions of the agent to be predicted with the surrounding agents and map elements. It also takes the form of a CVAE architecture and uses two MLP encoders and an MLP decoder similar to those described above, but with additional context inputs and larger hidden dimensions. We chose a latent space dimension of 16 because that gave satisfactory results in terms of final displacement error. The social and map interactions are accounted for with a modified multi-context gating block noted MMCG and are inspired from MultiPath++. They are composed of 3 context gating blocks. The context gating blocks each count three MLP modules with a hidden dimension of 256 (twice the input dimension). The MLP modules each count three layers and ReLU activations.

Our modified context-gating block is represented. We stack these modified CG blocks with a running average of their output exactly as in MultiPath++.

Computational considerations:

The overall model in the figure below counts 15.8M parameters. The CVAE model was trained for 6h20 minutes on a single GPU Nvidia Titan Xp. Then its parameters are frozen and the biased encoder was trained for 4 days and 10 hours on the same GPU. This second training is time consuming because it involves the estimation of the risk with 64 then 256 samples which multiplies the tensor dimension and requires a smaller batch size to be used to fit in the GPU memory. This is exactly the computation overhead that our proposed method reduces at inference.

Because we only use fully connected layers, the overall complexity of the model is O(b×s×a×(t×f)×h) + O(b×s×a×h^2) + O(b×o×(ms×mf)×h) with batch size b, number of samples s, number of agents a, hidden feature dimension h, time sequence length t, input feature dimension f, map element sequence length ms, map input feature dimension mf. With our choices of hyperparameters s×a×(t×f)>o×(ms×mf) and (t×f)>h so the complexity is O(b×s×a×(t×f)×h). At inference time, the number of samples s can be kept small using our method and the batch dimension is 1. The most expensive operation is the first matrix multiplication but this operation can be easily parallelized and is often well optimized. The most limiting aspect might be the memory footprint. At test time with 20 samples, the allocated GPU memory reaches almost 2GiB.

The videos below represents a fixed set of 6 predicted samples while the risk-level input is gradually increased. This shows the effect of the risk-level on the pessimism of the forecast as well as some of the failure cases discussed in the paper: assuming the wrong future trajectory for the ego or producing somewhat unrealistic paths.

output-299.mp4
output-210.mp4
output-106.mp4
output-109.mp4
output-116.mp4
output-107.mp4
output-113.mp4
output-111.mp4
output-117.mp4
output-118.mp4
output-108.mp4
output-212.mp4
output-110.mp4
output-120.mp4
output-201.mp4
output-200.mp4
output-206.mp4
output-7.mp4
output-22.mp4
output-16.mp4
output-21.mp4
output-11.mp4
output-102.mp4
output-12.mp4