Discuss Before Moving: Visual Language Navigation 

via Multi-expert Discussions


Yuxing LongXiaoqi Li, Wenzhe Cai, Hao Dong

Center on Frontiers of Computing Studies (CFCS), School of CS, Peking University

ICRA 2024 

arXiv, Code 

Abstract

Visual language navigation (VLN) is an embodied task demanding a wide range of skills encompassing understanding, perception, and planning. For such a multifaceted challenge, previous VLN methods totally rely on one model's own thinking to make predictions within one round. However, existing models, even the most advanced large language model GPT4, still struggle with dealing with multiple tasks by single-round self-thinking. In this work, drawing inspiration from the expert consultation meeting, we introduce a novel zero-shot VLN framework. Within this framework, large models possessing distinct abilities are served as domain experts. Our proposed navigation agent, namely DiscussNav, can actively discuss with these experts to collect essential information before moving at every step. These discussions cover critical navigation subtasks like instruction understanding, environment perception, and completion estimation. Through comprehensive experiments, we demonstrate that discussions with domain experts can effectively facilitate navigation by perceiving instruction-relevant information, correcting inadvertent errors, and sifting through in-consistent movement decisions. The performances on the representative VLN task R2R show that our method surpasses the leading zero-shot VLN model by a large margin on all metrics. Additionally, real-robot experiments display the obvious advantages of our method over single-round self-thinking.

Method


Fig.1. Demonstration of navigation agent DiscussNav powered by large language model (GPT4) 


Fig.2. Establishments of domain experts and discussions between the DiscussNav agent and multi-expert 

DiscussNav Deployed in the Simulation


Walk into the arched hallway near the native American paintings on the wall. Walk past the doorway to the laundry room and down the hall past the candles on the wall. Continue to the open arched entry at the end of the hall.

Once at the top of the stairs, turn left and walk towards the first doorway.  Enter the first doorway and walk towards the bed.  Once at the foot of the bed, turn left and walk through the door.  You should now be standing in front of a sink and mirror.


Walk forward across the room and walk through the panty followed by the kitchen.  Stand at the end of the kitchen.

DiscussNav Deployed in the Real-World

Open-vocabulary Landmarks


Move forward passing the table with a robot arm. Then, turn right at the corner.

Fine-grained Landmarks

Move to the potted plant on brown kitchen counter. Then, go to the teddy bear on the black chair. Lastly, turn right to the table with paper box on it.

Room Change

Go to the fire extinguisher first. Turn right to exit the office through the wooden door. Stop when you enter the hallway.

BibTeX 

@article{long2023discuss,

  title={Discuss before moving: Visual language navigation via multi-expert discussions},

  author={Long, Yuxing and Li, Xiaoqi and Cai, Wenzhe and Dong, Hao},

  journal={arXiv preprint arXiv:2309.11382},

  year={2023}

}