Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation
Anonymous Authors
To submit to IEEE R-AL
Anonymous Authors
To submit to IEEE R-AL
Abstract
Navigating human-filled spaces is crucial for the interactive social robots to support advanced services, such as cooperative carrying, which enables service provision in complex and crowded environments while adapting behavior based on real-time human language commands or feedback. However, existing social robot navigation planners face two major challenges: managing real-time user inputs and ensuring socially compliant behaviors in unfamiliar, zero-shot environments. In response, we introduce SALM, an interactive, human-in-loop Socially-Aware navigation Large Language Model framework that dynamically integrates deep reinforcement learning (DRL) with large language model (LLM) capabilities. SALM leverages contextual semantic understanding from real-time human-robot interactions to convert high-level user commands into precise, low-level control actions. A high-level LLM module parses user input, guiding the simultaneous generation of navigation commands by both a large language navigation model (LNM) and a DRL-based navigation model (RLNM). A memory mechanism archives temporal data for continuous refinement, while a multi-step graph-of-thoughts inference-based large language feedback model adaptively fuses the strengths of both planning approaches. Experimental evaluations demonstrate that SALM not only enhances navigational precision in crowded, dynamic environments but also significantly improves system adaptability, offering tailored behaviors that align with individual user preferences and real-time feedback.
Architecture of SALM
SALM architecture: SALM is implemented as a human-in-loop interactive social robot navigation framework, which executes human commands based on LM-based planner, feedback-based planner, and DRL-based planner incorporating. Firstly, users' requests or real-time feedback are processed or replanned to high-level task guidance for three action executors via LLM. Then, the image-to-text encoder and spatio-temporal graph HRI encoder convert robot local observation information to features as LNM and RLNM input, which generate RL-based action, LM-based action, and feedback-based action. Lastly, the above three actions are adaptively fused by a low-level execution decoder as the robot behavior output of SALM.
Large Language Navigation Model (LNM) Illustration
An illustration of large language navigation model (LNM): The prompt engineering of LNM comprises task description, global guidance, data annotation, initialization, historical data, additional information, and encoded state to directly generate low-level robot actions.
Reinforcement Learning Navigation Model (RLNM) Illustration [1]
The RLNM is composed of two parts: 1) Spatial Temporal Graph Transforemer Block, and 2) Multi-Modal Transformer Block. And RLNM utilizes a spatial-temporal graph transformer block and a multi-modal transformer block to abstract environmental dynamics and human-robot interactions into an ST-graph for safe path planning in crowd-flled environments. The spatial transformer is designed to capture hybrid spatial interactions and generate spatial attention maps, while the temporal transformer presents long-term temporal dependencies and creates temporal attention maps. The multi-modal transformer is deployed to adapt to the uncertainty of multi-modality crowd movements, aggregating all heterogeneous spatial and temporal features. Finally, the planner generates the next timestep action by a decoder.
Large Language Feedback Model (LFM) Illustration
LFM framework: LFM reconciles the output from LNM and RLNM to stabilize final mixture action, in which the Graph-of-Thought (GoT) construction of LFM is designed to evaluate and score the above two executions with more generated evidences or intermediate steps chains from different perspectives.
Comparison Simulation Experiments and Trajectory Illustrations
The illustration of human-in-loop interactive social navigation: The social robot is navigating toward the red star destination with a blue circle user across ten green circle humans.
We designed two tasks for social navigation, in which are human-following task and point-to-point task as following figures.
Additionally, a dynamic simulated human feedback function is also deigned that user's feedback is randomly generated with 50% probability to robot(e.g. change the goal position).
Comparison Simulation Experiments and Trajectory Illustrations
Global LLM (GLLM) Implementation Details Visualization
Reinforcement Learning Navigation Model (RLNM) Implementation Details Visualization [1]
We directly emply this path planner as our RLNM
(Input: obs -> self.actor (NaviSTAR) -> action)
Note: The code of NaviSTAR is publiced by github link [link]
Large Language Navigation Model (LNM) Implementation Details Visualization
The Entire Text Demo of LNM Input-Output
Large Language Feedback Model (LFM) Implementation Details Visualization
(Each node presents a LLM middle-step thought of Graph-of-Thought (GoT))
(Each edge presents an operation of Graph-of-Thought (GoT))
The Entire Text Demo of LFM (node1, node3, node6, node8) Output
The Entire Text Demo of LFM (node10) Input-Output
References