MotionPerceiver - Real-Time Occupancy Forecasting for Embedded Systems

Bryce Ferenczi, Michael Burke, Tom Drummond

Abstract

This work introduces a novel and adaptable architecture designed for real-time occupancy forecasting that outperforms existing state-of-the-art models on the Waymo Open Motion Dataset in Soft IOU. The proposed model uses recursive latent state estimation with learned transformer-based functions to effectively update and evolve the state. This enables highly efficient real-time inference on embedded systems, as profiled on an Nvidia Xavier AGX. Our model, MotionPerceiver, achieves this by encoding a scene into a latent state that evolves in time through self-attention mechanisms. Additionally, it incorporates relevant scene observations, such as traffic signals, road topology and agent detections, through cross-attention mechanisms. This forms an efficient data-streaming architecture, that contrasts with the expensive, fixed-sequence input common in existing models. The architecture also offers the distinct advantage of generating occupancy predictions through localized querying based on a position of interest, as opposed to generating fixed-size occupancy images, including potentially irrelevant regions.

Github

Bibtex

@ARTICLE{10417132,
author={Ferenczi, Bryce and Burke, Michael and Drummond, Tom},
journal={IEEE Robotics and Automation Letters},
title={MotionPerceiver: Real-Time Occupancy Forecasting for Embedded Systems},
year={2024},
volume={9},
number={3},
pages={2822-2829},
keywords={Forecasting;Predictive models;Trajectory;Real-time systems;Encoding;Tracking;Prediction algorithms;Computer vision for transportation;deep learning for visual perception;representation learning},
doi={10.1109/LRA.2024.3360811}}

MP_RA-L_S.mkv

Inference Pattern

Below is a diagram representing how this model streams observations from the environment and predicts into the future. The rows of the diagram is the inference sequence at a real timestep, whereas the columns are forecasting timesteps. At T=0 (the first row), the latent state is initialized with the very first obervation of the scene, this is then propagated (horizontally) to the T=1 prediction column. Contextual information that doesn't change over time, such as road topology can be transfered to update the latent state at every forecasted timestep (Static Context). After this contexual update, the latent state can be queried for occupancy prediction. This Propagate->Context->Predict chain can be performed recursively (shown as T=1->3...). When we are at real timestep T=1 (row T=1), we can update the previous estimated latent state (row T=0, column T=1) with a new scene observation, shown in the latent state transition from TimePropagate (0,1) to SceneObservation (1,1). We can then re-run the forecast chain, which will be more accurate than the previous forecast chain, since it has a newer observation from T=1. This process can be repeated indefinitely, shown in the continuation from TimePropagate(1,2) to SceneObservation(2,2).

A two-phase inference pattern uses two "Time Propagate" modules (TP 2 and TP 1 below) in order to evolve the latent state at two different rates. This is ideal in situations where the sensor sample time differs from the desired prediciton periodicity. A simplified inference example is below. TP 2 predicts at a perod of two time steps to forecast further into the future with each iteration. Whereas TP 1 predicts one time step into the future, the same period of the sensor sample time. The diagram below follows the previous convention of rows are "real time" and colums are "future" steps, however omits "Static Context" for brevity.

Example Output of Three Architecture Variants

Images are color coded - Green for True Positive (Pr>= 0.5) - Red for False Negative (Pr< 0.5) - Blue for False Positive (Pr>0, you should notice it fade more than the others)

No Context - Only has vehicle state information as input (position, heading, velocity, size), observed past at 2Hz and Predicts at 10Hz
All Context - Road topology as well as traffic signals (in the past) are added as context, observed past at 2Hz and Predicts at 10Hz
Two Phase - All context included, observes past at 10Hz and predicts the future at 1Hz

Joint Training with Flow

Example outputs of joint training with occupancy flow can be found here in the JointFlow folder. Examples of the difference between our occupancy flow and the waymo generated occupancy flow are in FlowDiff. Other folders contain more examples from the occupancy-only models are also available at that link in the other folders.

No Context

9ef706306068c81_heatmap.webm

fdc7f0459073607f_heatmap.webm

60a2486b78f7001b_heatmap.webm

ff98da6688e8e5f8_heatmap.webm

111aeea80c38c890_heatmap.webm

1fb1b3d90568ff87_heatmap.webm

612b441e1cc59f29_heatmap.webm

e71177579926fcb9_heatmap.webm

293711904fe4336a_heatmap.webm

21866a2de614ab83_heatmap.webm

69de8c1158dd31a4_heatmap.webm

28b639ecd389e69b_heatmap.webm

5cd35afb24d76eff_heatmap.webm

All Context

9ef706306068c81_heatmap.webm

fdc7f0459073607f_heatmap.webm

60a2486b78f7001b_heatmap.webm

ff98da6688e8e5f8_heatmap.webm

111aeea80c38c890_heatmap.webm

1fb1b3d90568ff87_heatmap.webm

612b441e1cc59f29_heatmap.webm

e71177579926fcb9_heatmap.webm

293711904fe4336a_heatmap.webm

21866a2de614ab83_heatmap.webm

69de8c1158dd31a4_heatmap.webm

28b639ecd389e69b_heatmap.webm

5cd35afb24d76eff_heatmap.webm

Two Phase

9ef706306068c81_heatmap.webm

fdc7f0459073607f_heatmap.webm

60a2486b78f7001b_heatmap.webm

ff98da6688e8e5f8_heatmap.webm

111aeea80c38c890_heatmap.webm

1fb1b3d90568ff87_heatmap.webm

612b441e1cc59f29_heatmap.webm

e71177579926fcb9_heatmap.webm

293711904fe4336a_heatmap.webm

21866a2de614ab83_heatmap.webm

69de8c1158dd31a4_heatmap.webm

28b639ecd389e69b_heatmap.webm

5cd35afb24d76eff_heatmap.webm

Page updated

Report abuse

MotionPerceiver - Real-Time Occupancy Forecasting for Embedded Systems

Bryce Ferenczi, Michael Burke, Tom Drummond

Abstract

Github

Bibtex

Inference Pattern

Example Output of Three Architecture Variants

Joint Training with Flow

No Context

All Context

Two Phase

Corresponding Author - Bryce.Ferenczi@monash.edu