Hyper-STTN: Hypergraph Augmented Spatial-Temporal Transformer for Trajectory Prediction

Anonymous Authors

Anonymous University

Under reviewed ICRA

Abstract

Predicting crowded intents and trajectories is crucial in varouls real-world applications, including service robots and autonomous vehicles. Understanding environmental dynamics is challenging, not only due to the complexities of modeling pair-wise spatial and temporal interactions but also the diverse influence of group-wise interactions. To decode the comprehensive pair-wise and group-wise interactions in crowded scenarios, we introduce Hyper-STTN, a Hypergraph-based Spatial-Temporal Transformer Network for crowd trajectory prediction. In Hyper-STTN, crowded group-wise correlations are constructed using a set of multi-scale hypergraphs with varying group sizes, captured through random-walk robability-based hypergraph spectral convolution. Additionally, a spatial-temporal transformer is adapted to capture pedestrians' pair-wise latent interactions in spatial-temporal dimensions. These heterogeneous group-wise and pair-wise are then fused and aligned though a multimodal transformer network. Hyper-STTN outperformes other state-of-the-art baselines and ablation models on 5 real-world pedestrian motion datasets.

Architecture of Hyper-STTN

Hyper-STTN neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent’s long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block and a self-transformer block to abstract the uncertainty of multimodality crowd movements.

Framework of HGNN and STTN blocks

Group-wise HHI Representation: i) We construct group-wise HHI with a set of multiscale hypergraphs, where each agent is queried in the feature space with varying ‘k’ in KNN to link multiscale hyperedges. ii) After constructing HHI hypergraphs, group-wise dependencies are captured by point-to-edge and edge-to-point phases with hypergraph spectral convolution operations.

Hybrid Spatial-Temporal Transformer Framework: Pedestrians’ motion intents and dependencies are abstracted as spatial and temporal attention maps by multi-head attention mechanism of spatial-temporal transformer. Additionally, a multi-head cross attention mechanism is employed to align heterogeneous spatial-temporal features.

Experiments and Results on ETH-UCY Datasets

Note: Stochastic conditions with ADE-20 and FDE-20 evaluation metrics.

Note: ETH-UCY dataset is composed by ETH, Hotel, UNIV, Zara01, and Zara02 subsets.

Note: Deterministic conditions with ADE and FDE evaluation metrics.

Project Supplemental Materials

[Supplemental Materials]: All supplemental materials of Hyper-STTN.

Dataset Pre-Processing

The mask information output of pre-processing program is employed to support mask attention mechanism.

1. [The dataset Pre-Processing program]; The entire code of pre-processing procedure.

2. [Original ETH-UCY Dataset]; The original dataset is the input of pre-processing program.

3. [Validation Dataset]; The human location output of pre-processing program.

4. [Validation's Mask Dataset]; The mask information output of pre-processing program.

Code Illustration

1. [HyperSTTN: Main Framework]; The code illustration of Hyper-STTN main function.

2. [HyperSTTN: Hypergraph Construction]; The code illustration of crowd hypergraph generation.

3. [HyperSTTN: Hypergraph Neural Network]; The code illustration of HGNN (Hypergraph Neural Network) that is employed to capture group-wise human-human interactions features.

4. [HyperSTTN: Masked Spatial-Temporal Attention]; The code illustration of masked spatial-temporal attention mechanism.

5. [HyperSTTN: Spatial-Temporal Transformer]; The code illustration of masked spatial-temporal attention transformer that is leveraged to capture pair-wise human-human interactions features.

6. [HyperSTTN: Encoder]; The code illustration of Hyper-STTN encoder.

7. [HyperSTTN: Decoder]; The code illustration of Hyper-STTN decoder.

Validation Trajectory Visualization

1. [ETH Dataset]; The trajectory prediction reslut of Hyper-STTN on ETH sub-dataset.

2. [HOTEL Dataset]; The trajectory prediction reslut of Hyper-STTN on Hotel sub-dataset.

3. [Univ Dataset]; The trajectory prediction reslut of Hyper-STTN on UNIV sub-dataset.

Trajectories Visualization from ETH-UCY Dataset

More Trajectories Instances from ETH-UCY Datasets

More Trajectories Instances from NBA Datasets

More Trajectory Visualization Illustration on ETH

Spatial-Temporal Attention Matrix Visualization

More spatial-temporal attention visualizations refer to the link: [Attention Matrix]

Spatial Attention Matrix

The rows and columns correspond to individual agents, and each element represents the pairwise spatial attention correlation between agents. (Note: the diagonal elements outside the valid range are the mask value.)

Temporal Attention Matrix

The columns correspond to agent indices, and the rows correspond to temporal steps. Each element represents the correlation between an individual agent and its historical trajectory data.

Crowd Multiscale Hypergraph Construction Illustration

More crowd multiscale hypergraph construction visualizations refer to the link: [Hypergraph Construction]

Scale-2 Crowd Hypergraph Vertex Degree Matrix and Hyperedgy Degree Matrix Visualization

Scale-3 Crowd Hypergraph Vertex Degree Matrix and Hyperedgy Degree Matrix Visualization

Scale-4 Crowd Hypergraph Vertex Degree Matrix and Hyperedgy Degree Matrix Visualization

Scale-5 Crowd Hypergraph Vertex Degree Matrix and Hyperedgy Degree Matrix Visualization

Optimal Testing Results

The results of optimal Hyper-STTN model with the CVAE setting (Stochastic Experiments with ADE_20 and FDE_20) (Notice: some result may outperform the reported results on the paper may due to the additional multi-turn training epoches or hyperparameter engineering)

The testing procedure of Hyper-STTN on ETH dataset.

The testing procedure of Hyper-STTN on HOTEL dataset.

The testing procedure of Hyper-STTN on UNIV dataset.

Project Video

Project_Video.mp4

Page updated

Google Sites

Report abuse