Hyper-STTN: Social Group-aware Spatial-Temporal Transformer Network for Human Trajectory Prediction with Hypergraph Reasoning


Weizheng Wang*, Le Mao*, Baijian Yang, Guohua Chen, and Byung-Cheol Min

(* equal author)

 SMART Lab, Purdue University

Under reviewed R-AL 

[Video] [Paper]


Abstract

Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.

Architecture of Hyper-STTN

Hyper-STTN neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent’s long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block and a self-transformer block to abstract the uncertainty of multimodality crowd movements. 

Framework of HGNN and STTN blocks

Group-wise HHI Representation: i) We construct group-wise HHI with a set of multiscale hypergraphs, where each agent is queried in the feature space with varying ‘k’ in KNN to link multiscale hyperedges. ii) After constructing HHI hypergraphs, group-wise dependencies are captured by point-to-edge and edge-to-point phases with hypergraph spectral convolution operations. 

Hybrid Spatial-Temporal Transformer Framework: Pedestrians’ motion intents and dependencies are abstracted as spatial and temporal attention maps by multi-head attention mechanism of spatial-temporal transformer. Additionally, a multi-head cross attention mechanism is employed to align heterogeneous spatial-temporal features. 

Experiments and Results on ETH-UCY Datasets

Note: Stochastic conditions with ADE-20 and FDE-20 evaluation metrics.



Note: Deterministic conditions with ADE and FDE evaluation metrics.



Trajectories Visulazation from ETH-UCY Dataset

More Trajectories Instances from ETH Datasets

Attention Map and Attention Matrix Visualization

Spatial Attention Matrix

Temporal Attention Matrix

Testing Procedure

The testing procedure of Hyper-STTN on HOTEL dataset.