Hyper-STTN: Social Group-aware Spatial-Temporal Transformer Network for Human Trajectory Prediction with Hypergraph Reasoning


Weizheng Wang,  Chaowei Wang, Baijian Yang, Guohua Chen, and Byung-Cheol Min


 SMART Lab, Purdue University

Under reviewed R-AL 

[Paper]


Abstract

Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.

Architecture of Hyper-STTN

Hyper-STTN neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent’s long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block and a self-transformer block to abstract the uncertainty of multimodality crowd movements. 

Framework of HGNN and STTN blocks

Group-wise HHI Representation: i) We construct group-wise HHI with a set of multiscale hypergraphs, where each agent is queried in the feature space with varying ‘k’ in KNN to link multiscale hyperedges. ii) After constructing HHI hypergraphs, group-wise dependencies are captured by point-to-edge and edge-to-point phases with hypergraph spectral convolution operations. 

Hybrid Spatial-Temporal Transformer Framework: Pedestrians’ motion intents and dependencies are abstracted as spatial and temporal attention maps by multi-head attention mechanism of spatial-temporal transformer. Additionally, a multi-head cross attention mechanism is employed to align heterogeneous spatial-temporal features. 

Experiments and Results on ETH-UCY Datasets

Note: Stochastic conditions with ADE-20 and FDE-20 evaluation metrics.

Note: Deterministic conditions with ADE and FDE evaluation metrics.



Trajectories Visulazation from ETH-UCY Dataset

More Trajectories Instances from ETH Datasets

More Trajectories Instances from NBA Datasets

Attention Map and Attention Matrix Visualization

Spatial Attention Matrix

Temporal Attention Matrix

Testing Procedure

The testing procedure of Hyper-STTN on HOTEL dataset.