Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer for Trajectory Prediction
Anonymous Authors
Anonymous University
Under reviewed ICRA
Anonymous Authors
Anonymous University
Under reviewed ICRA
Abstract
Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.
Architecture of Hyper-STTN
Hyper-STTN neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent’s long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block and a self-transformer block to abstract the uncertainty of multimodality crowd movements.
Framework of HGNN and STTN blocks
Group-wise HHI Representation: i) We construct group-wise HHI with a set of multiscale hypergraphs, where each agent is queried in the feature space with varying ‘k’ in KNN to link multiscale hyperedges. ii) After constructing HHI hypergraphs, group-wise dependencies are captured by point-to-edge and edge-to-point phases with hypergraph spectral convolution operations.
Hybrid Spatial-Temporal Transformer Framework: Pedestrians’ motion intents and dependencies are abstracted as spatial and temporal attention maps by multi-head attention mechanism of spatial-temporal transformer. Additionally, a multi-head cross attention mechanism is employed to align heterogeneous spatial-temporal features.
Experiments and Results on ETH-UCY Datasets
Note: Stochastic conditions with ADE-20 and FDE-20 evaluation metrics.
Note: ETH-UCY dataset is composed by ETH, Hotel, UNIV, Zara01, and Zara02 subsets.
Note: Deterministic conditions with ADE and FDE evaluation metrics.
Project Supplemental Materials
[Supplemental Materials]: All supplemental materials of Hyper-STTN.
Dataset Pre-Processing
The mask information output of pre-processing program is employed to support mask attention mechanism.
1. [The dataset Pre-Processing program]; The entire code of pre-processing procedure.
2. [Original ETH-UCY Dataset]; The original dataset is the input of pre-processing program.
3. [Validation Dataset]; The human location output of pre-processing program.
4. [Validation's Mask Dataset]; The mask information output of pre-processing program.
Code Illustration
1. [HyperSTTN: Main Framework]; The code illustration of Hyper-STTN main function.
2. [HyperSTTN: Hypergraph Construction]; The code illustration of crowd hypergraph generation.
3. [HyperSTTN: Hypergraph Neural Network]; The code illustration of HGNN (Hypergraph Neural Network) that is employed to capture group-wise human-human interactions features.
4. [HyperSTTN: Masked Spatial-Temporal Attention]; The code illustration of masked spatial-temporal attention mechanism.
5. [HyperSTTN: Spatial-Temporal Transformer]; The code illustration of masked spatial-temporal attention transformer that is leveraged to capture pair-wise human-human interactions features.
6. [HyperSTTN: Encoder]; The code illustration of Hyper-STTN encoder.
7. [HyperSTTN: Decoder]; The code illustration of Hyper-STTN decoder.
Validation Trajectory Visualization
1. [ETH Dataset]; The trajectory prediction reslut of Hyper-STTN on ETH sub-dataset.
2. [HOTEL Dataset]; The trajectory prediction reslut of Hyper-STTN on Hotel sub-dataset.
3. [Univ Dataset]; The trajectory prediction reslut of Hyper-STTN on UNIV sub-dataset.
Trajectories Visualization from ETH-UCY Dataset
More Trajectories Instances from ETH-UCY Datasets
More Trajectories Instances from NBA Datasets
More Trajectory Visualization Illustration on ETH
More Trajectory Visualization Illustration on HOTEL
More Trajectory Visualization Illustration on UNIV
More Trajectory Visualization Illustration on ZARA01
More Trajectory Visualization Illustration on ZARA02
Spatial-Temporal Attention Matrix Visualization
More spatial-temporal attention visualizations refer to the link: [Attention Matrix]
The rows and columns correspond to individual agents, and each element represents the pairwise spatial attention correlation between agents. (Note: the diagonal elements outside the valid range are the mask value.)
The columns correspond to agent indices, and the rows correspond to temporal steps. Each element represents the correlation between an individual agent and its historical trajectory data.
Crowd Multiscale Hypergraph Construction Illustration
More crowd multiscale hypergraph construction visualizations refer to the link: [Hypergraph Construction]
Optimal Testing Results
The results of optimal Hyper-STTN model with the CVAE setting (Stochastic Experiments with ADE_20 and FDE_20) (Notice: some result may outperform the reported results on the paper may due to the additional multi-turn training epoches or hyperparameter engineering)
The testing procedure of Hyper-STTN on ETH dataset.
The testing procedure of Hyper-STTN on HOTEL dataset.
The testing procedure of Hyper-STTN on UNIV dataset.
Project Video