NaviSTAR: Socially Aware Robot Navigation with Hybrid Spatio-Temporal Graph Transformer and Preference Learning  


Weizheng Wang, Ruiqi Wang, Le Mao, and Byung-Cheol Min

 SMART Lab, Purdue University

To appear IROS 2023 

[Code] [Video] [Paper]


Abstract

Developing robotic technologies for use in human society requires ensuring the safety of robots’ navigation behaviors while adhering to pedestrians’ expectations and social norms. However, maintaining real-time communication between robots and pedestrians to avoid collisions can be challenging. To address these challenges, we propose a novel socially-aware navigation benchmark called NaviSTAR, which utilizes a hybrid Spatio-Temporal grAph tRansformer (STAR) to understand interactions in human-rich environments fusing potential crowd multi-modal information. We leverage off-policy reinforcement learning algorithm with preference learning to train a policy and a reward function network with supervisor guidance. Additionally, we design a social score function to evaluate the overall performance of social navigation. To compare, we train and test our algorithm and other state-of-the-art methods in both simulator and real-world scenarios independently. Our results show that NaviSTAR outperforms previous methods with outstanding performance.

Architecture of NaviSTAR

The NaviSTAR is composed of two parts: 1) Spatial Temporal Graph Transforemer Block, and 2) Multi-Modal Transformer Block. And NaviSTAR utilizes a spatial-temporal graph transformer block and a multi-modal transformer block to abstract environmental dynamics and human-robot interactions into an ST-graph for safe path planning in crowd-flled environments. The spatial transformer is designed to capture hybrid spatial interactions and generate spatial attention maps, while the temporal transformer presents long-term temporal dependencies and creates temporal attention maps. The multi-modal transformer is deployed to adapt to the uncertainty of multi-modality crowd movements, aggregating all heterogeneous spatial and temporal features. Finally, the planner generates the next timestep action by a decoder.

Framework of Transformer

NaviSTAR neural network framework: (a) Spatial Transformer leverages a multi-head attention layer and a graph convolution network along the time-dimension to represent spatial attention features and spatial relational features; (b) Temporal Transformer utilizes multi-head attention layers to capture each individual agent’s long-term temporal attention dependencies; and (c) Multi-Modal Transformer fuses heterogeneous spatial and temporal features via a multi-head cross-modal transformer block [1] and a self-transformer block [2] to abstract the uncertainty of multimodality crowd movements. 

Attention Map and Attention Matrix Visualization

Attention Maps

An illustration of spatial attention maps and temporal attention maps: sub-fgures (a), (b), and (c) exhibit the spatial attention maps from different agents at the same timestep; (d), (e), and (f) present the temporal attention maps from different timesteps in the same agent’s view. The radius of the circle represents the importance level based on the perceptive of the agent represented by the red circle. 

Attention Matrix

An illustration of attention matrices: Visualization of a NaviSTAR attention weight group example consisting of spatial and temporal attention matrices at the result of the spatial and temporal transformer, cross attention matrices at the fnal layer of the multi-modal transformer, and a fnal attention matrix from the self-attention transformer.

Simulation Experiments and Real-world User Study

NaviSTAR.mp4

Note:  The robot is enforced to stop when the action from its policy contains a turning degree greater than 90 degrees.

References

[1] Wang, Ruiqi, et al. "Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition." arXiv preprint arXiv:2209.15182 (2022). 

[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).