InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment


Yuxing Long*, Wenzhe Cai*, Hongcheng Wang, Guanqi Zhan, Hao Dong

CFCS, School of Computer Science, Peking University

PKU-Agibot Lab, School of Computer Science, Peking University

National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University

School of Automation, Southeast University

University of Oxford


Arxiv | Code | Video

Abstract

Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation  instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method’s robustness in coping with the environment and instruction variations.

Dynamic Chain-of-Navigation

As different tasks require different navigation strategies, to unify the planning process, we propose the Dynamic Chain-of-Navigation, which utilizes the reasoning ability inherent in LLMs and outputs a formatted plan as "Action 1 - Landmark A - Action 2 - Landmark B...". The DCoN prediction takes all the map information, task instruction, and history DCoN results into account. Then, the predicted "Action" and "Landmark" will be used to decide the navigation trajectory.

Multi-Sourced Value Map

The DCoN only exhibits the reasoning results in text, in order to translate the text results into actionable robot trajectories, we introduce the Multi-Sourced Value Map, which measures the function score for every coordinate within the online point cloud map. The multi-sourced value maps consist of 4 parts, intuition value map, trajectory value map, action value map, and semantic value map. Then, the combined value map will be used to decide the best navigation waypoint as well as a collision-free trajectory.

Experiments

We evaluate our InstructNav in multiple different navigation tasks, including Object-Goal Navigation, VLN-CE, and Demand-Driven Navigation. Our training-free approach surpasses many training-based method in VLN-CE and outperforms the previous SOTA method by 10.48% in ObjNav, 86.34% in Demand-Driven Navigation. InstructNav can also achieve competitive results with open-sourced models, such as LLama and LLava.

Project Video