Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation

Anonymous Authors

[LFM (Node1-8) Textual_Demo] [LFM (Node10) Textual_Demo]

Abstract

Navigating human-filled spaces is crucial for the interactive social robots to support advanced services, such as cooperative carrying, which enables service provision in complex and crowded environments while adapting behavior based on real-time human language commands or feedback. However, existing social robot navigation planners face two major challenges: managing real-time user inputs and ensuring socially compliant behaviors in unfamiliar, zero-shot environments. In response, we introduce SALM, an interactive, human-in-loop Socially-Aware navigation Large Language Model framework that dynamically integrates deep reinforcement learning (DRL) with large language model (LLM) capabilities. SALM leverages contextual semantic understanding from real-time human-robot interactions to convert high-level user commands into precise, low-level control actions. A high-level LLM module parses user input, guiding the simultaneous generation of navigation commands by both a large language navigation model (LNM) and a DRL-based navigation model (RLNM). A memory mechanism archives temporal data for continuous refinement, while a multi-step graph-of-thoughts inference-based large language feedback model adaptively fuses the strengths of both planning approaches. Experimental evaluations demonstrate that SALM not only enhances navigational precision in crowded, dynamic environments but also significantly improves system adaptability, offering tailored behaviors that align with individual user preferences and real-time feedback.

Architecture of SALM

SALM architecture: SALM is implemented as a human-in-loop interactive social robot navigation framework, which executes human commands based on LM-based planner, feedback-based planner, and DRL-based planner incorporating. Firstly, users' requests or real-time feedback are processed or replanned to high-level task guidance for three action executors via LLM. Then, the image-to-text encoder and spatio-temporal graph HRI encoder convert robot local observation information to features as LNM and RLNM input, which generate RL-based action, LM-based action, and feedback-based action. Lastly, the above three actions are adaptively fused by a low-level execution decoder as the robot behavior output of SALM.

Large Language Navigation Model (LNM) Illustration

An illustration of large language navigation model (LNM): The prompt engineering of LNM comprises task description, global guidance, data annotation, initialization, historical data, additional information, and encoded state to directly generate low-level robot actions.

Reinforcement Learning Navigation Model (RLNM) Illustration [1]

The RLNM is composed of two parts: 1) Spatial Temporal Graph Transforemer Block, and 2) Multi-Modal Transformer Block. And RLNM utilizes a spatial-temporal graph transformer block and a multi-modal transformer block to abstract environmental dynamics and human-robot interactions into an ST-graph for safe path planning in crowd-flled environments. The spatial transformer is designed to capture hybrid spatial interactions and generate spatial attention maps, while the temporal transformer presents long-term temporal dependencies and creates temporal attention maps. The multi-modal transformer is deployed to adapt to the uncertainty of multi-modality crowd movements, aggregating all heterogeneous spatial and temporal features. Finally, the planner generates the next timestep action by a decoder.

Large Language Feedback Model (LFM) Illustration

LFM framework: LFM reconciles the output from LNM and RLNM to stabilize final mixture action, in which the Graph-of-Thought (GoT) construction of LFM is designed to evaluate and score the above two executions with more generated evidences or intermediate steps chains from different perspectives.

Comparison Simulation Experiments and Trajectory Illustrations

Interactive Social Navigation Simulator

The illustration of human-in-loop interactive social navigation: The social robot is navigating toward the red star destination with a blue circle user across ten green circle humans.

We designed two tasks for social navigation, in which are human-following task and point-to-point task as following figures.

Additionally, a dynamic simulated human feedback function is also deigned that user's feedback is randomly generated with 50% probability to robot(e.g. change the goal position).