This work introduces a novel and adaptable architecture designed for real-time occupancy forecasting that outperforms existing state-of-the-art models on the Waymo Open Motion Dataset in Soft IOU. The proposed model uses recursive latent state estimation with learned transformer-based functions to effectively update and evolve the state. This enables highly efficient real-time inference on embedded systems, as profiled on an Nvidia Xavier AGX. Our model, MotionPerceiver, achieves this by encoding a scene into a latent state that evolves in time through self-attention mechanisms. Additionally, it incorporates relevant scene observations, such as traffic signals, road topology and agent detections, through cross-attention mechanisms. This forms an efficient data-streaming architecture, that contrasts with the expensive, fixed-sequence input common in existing models. The architecture also offers the distinct advantage of generating occupancy predictions through localized querying based on a position of interest, as opposed to generating fixed-size occupancy images, including potentially irrelevant regions.
@ARTICLE{10417132,
author={Ferenczi, Bryce and Burke, Michael and Drummond, Tom},
journal={IEEE Robotics and Automation Letters},
title={MotionPerceiver: Real-Time Occupancy Forecasting for Embedded Systems},
year={2024},
volume={9},
number={3},
pages={2822-2829},
keywords={Forecasting;Predictive models;Trajectory;Real-time systems;Encoding;Tracking;Prediction algorithms;Computer vision for transportation;deep learning for visual perception;representation learning},
doi={10.1109/LRA.2024.3360811}}
Below is a diagram representing how this model streams observations from the environment and predicts into the future. The rows of the diagram is the inference sequence at a real timestep, whereas the columns are forecasting timesteps. At T=0 (the first row), the latent state is initialized with the very first obervation of the scene, this is then propagated (horizontally) to the T=1 prediction column. Contextual information that doesn't change over time, such as road topology can be transfered to update the latent state at every forecasted timestep (Static Context). After this contexual update, the latent state can be queried for occupancy prediction. This Propagate->Context->Predict chain can be performed recursively (shown as T=1->3...). When we are at real timestep T=1 (row T=1), we can update the previous estimated latent state (row T=0, column T=1) with a new scene observation, shown in the latent state transition from TimePropagate (0,1) to SceneObservation (1,1). We can then re-run the forecast chain, which will be more accurate than the previous forecast chain, since it has a newer observation from T=1. This process can be repeated indefinitely, shown in the continuation from TimePropagate(1,2) to SceneObservation(2,2).
A two-phase inference pattern uses two "Time Propagate" modules (TP 2 and TP 1 below) in order to evolve the latent state at two different rates. This is ideal in situations where the sensor sample time differs from the desired prediciton periodicity. A simplified inference example is below. TP 2 predicts at a perod of two time steps to forecast further into the future with each iteration. Whereas TP 1 predicts one time step into the future, the same period of the sensor sample time. The diagram below follows the previous convention of rows are "real time" and colums are "future" steps, however omits "Static Context" for brevity.
Images are color coded - Green for True Positive (Pr>= 0.5) - Red for False Negative (Pr< 0.5) - Blue for False Positive (Pr>0, you should notice it fade more than the others)
No Context - Only has vehicle state information as input (position, heading, velocity, size), observed past at 2Hz and Predicts at 10Hz
All Context - Road topology as well as traffic signals (in the past) are added as context, observed past at 2Hz and Predicts at 10Hz
Two Phase - All context included, observes past at 10Hz and predicts the future at 1Hz
Example outputs of joint training with occupancy flow can be found here in the JointFlow folder. Examples of the difference between our occupancy flow and the waymo generated occupancy flow are in FlowDiff. Other folders contain more examples from the occupancy-only models are also available at that link in the other folders.