SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning
Weizheng Wang,
To submit in
[Paper]
Weizheng Wang,
To submit in
[Paper]
Abstract
An interactive social robotic assistant must provide services in complex and crowded spaces while adapting its behavior based on real-time human language commands or feedback. In this paper, we propose a novel hybrid approach called Social Robot Planner (SRLM), which integrates Large Language Models (LLM) and Deep Reinforcement Learning (DRL) to navigate through human-filled public spaces and provide multiple social services. SRLM infers global planning from human-in-loop commands in real-time, and encodes social information into a LLM-based large navigation model (LNM) for low-level motion execution. Moreover, a DRL-based planner is designed to maintain benchmarking performance, which is blended with LNM by a large feedback model (LFM) to address the instability of current text and LLM-driven LNM. Finally, SRLM demonstrates outstanding performance in extensive experiments.
Architecture of SAMARL
SRLM framework mainly includes following three submodelsd: (1) LNM (language navigation model); (2) RLNM (reinforcement learning navigation model); (3) LFM (language feedback model).
SRLM is implemented as a human-in-loop interactive social robot navigation framework, which executes human commands based on LM-based planner, feedback-based planner, and DRL-based planner incorporating. Firstly, users' requests or real-time feedbacks are processed or replanned to high-level task guidance for three action executors via LLM. Then, the image-to-text encoder and spatio-temporal graph HRI encoder convert robot local observation information to features as LNM and RLNM input, which generate RL-based action, LM-based action, and feedback-based action. Lastly, the above three actions are adaptively fused by a low-level execution decoder as the robot behavior output of SRLM.
LNM Prompt Details
The prompt engineering of LNM comprises task description, global guidance, data annotation, initialization, historical data, additional information, and encoded state to directly generate low-level robot actions.
RLNM Details
The RLNM is composed of two parts: 1) Spatial Temporal Graph Transforemer Block, and 2) Multi-Modal Transformer Block. And RLNM utilizes a spatial-temporal graph transformer block and a multi-modal transformer block to abstract environmental dynamics and human-robot interactions into an ST-graph for safe path planning in crowd-flled environments. The spatial transformer is designed to capture hybrid spatial interactions and generate spatial attention maps, while the temporal transformer presents long-term temporal dependencies and creates temporal attention maps. The multi-modal transformer is deployed to adapt to the uncertainty of multi-modality crowd movements, aggregating all heterogeneous spatial and temporal features. Finally, the planner generates the next timestep action by a decoder.
LFM Details
LFM reconciles the output from LNM and RLNM to stabilize final mixture action, in which the GoT (graph-of-thought) construction of LFM is designed to evaluate and score the above two executions with more generated evidences or intermediate steps chains from different perspectives.
Learning Curve
Comparison Simulation Experiments and Trajectory Illustrations
More Learning Curves and Experiments of SAMARL with Different Seeds
More Testcase Visualization
Simulation Experiments and Real-world User Study Video
References
[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).