Abstract
Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.
Architecture of Hyper-STTN
Hyper-STTN neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent’s long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block and a self-transformer block to abstract the uncertainty of multimodality crowd movements.
Framework of HGNN and STTN blocks
Group-wise HHI Representation: i) We construct group-wise HHI with a set of multiscale hypergraphs, where each agent is queried in the feature space with varying ‘k’ in KNN to link multiscale hyperedges. ii) After constructing HHI hypergraphs, group-wise dependencies are captured by point-to-edge and edge-to-point phases with hypergraph spectral convolution operations.
Hybrid Spatial-Temporal Transformer Framework: Pedestrians’ motion intents and dependencies are abstracted as spatial and temporal attention maps by multi-head attention mechanism of spatial-temporal transformer. Additionally, a multi-head cross attention mechanism is employed to align heterogeneous spatial-temporal features.
Experiments and Results on ETH-UCY Datasets
Note: Stochastic conditions with ADE-20 and FDE-20 evaluation metrics.
Note: Deterministic conditions with ADE and FDE evaluation metrics.
Trajectories Visulazation from ETH-UCY Dataset
More Trajectories Instances from ETH Datasets
More Trajectories Instances from NBA Datasets
Attention Map and Attention Matrix Visualization
Spatial Attention Matrix
Temporal Attention Matrix
Testing Procedure
The testing procedure of Hyper-STTN on HOTEL dataset.