VLN-Game: Vision-Language Equilibrium Search

for Zero-Shot Semantic Navigation

Bangguo Yu¹², Yuzhen Liu¹✉️, Lei Han¹, Hamidreza Kasaei²✉️, Tingguang Li¹✉️, and Ming Cao²

¹Tencent Robotics X ²University of Groningen

Paper Link: https://arxiv.org/pdf/2411.11609

Abstract

Following human instructions to explore and search for a specified target in an unfamiliar environment is a crucial skill for mobile service robots. Most of the previous works on object goal navigation have typically focused on a single input modality as the target, which may lead to limited consideration of language descriptions containing detailed attributes and spatial relationships. To address this limitation, we propose VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively. To be more precise, our approach constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment. Then, the framework identifies the most promising areas to explore in search of potential target candidates. A game-theoretic vision language model is employed to determine which target best matches the given language description. Experiments conducted on the Habitat-Matterport 3D (HM3D) dataset demonstrate that the proposed framework achieves state-of-the-art performance in both object goal navigation and language-based navigation tasks. Moreover, we show that VLN-Game can be easily deployed on real-world robots. The success of VLN-Game highlights the promising potential of using game-theoretic methods with compact vision-language models to advance decision-making capabilities in robotic systems.

Framework

This framework utilizes posed RGB-D frames to generate a 3D object-centric map and an exploration map for robot navigation. Target descriptions are parsed by LLM to set the primary navigation goals. Using CLIP-based similarity assessments, the system evaluates the relevance between the target and environmental features to direct exploration activities. Upon detecting a potential target, a game-theoretic vision-language model analyzes spatial relationships described in the target instructions. Achievement of the long-term goal or target identification triggers a local policy that dictates the robot's final actions.

tv.mp4

Object Goal

Find a tv_screen

upstair_bed.mp4

Object Goal for multi-floor scene

Find a bed

sim-vlm.mp4

Language Goal navigation

There is a case showing the process of finding a white desk located in front of the cabinet, and with a window in front of it.

chair.mp4

Object Goal

Find a black office chair

VLMchair.mp4

Language Goal

Find a black office chair between a whiteboard and a desk

person.mp4

Object Goal

Find a sitting person

vlmperson.mp4

Language Goal

Find a person sitting on the sofa

Sofa.mp4

Object Goal

Find a brown sofa

vlmsofa.mp4

Language Goal

Find a brown sofa near a plant

Failure Cases

fc1.mp4

Failure Cases 1

Wrong detection of th TV

fc2.mp4

Failure Cases 2

Wrong detection of the couch

Code Link:

GitHub - ybgdgh/VLN-Game: A new zero-shot framework to explore and search for the language descriptive targets in unknown environment based on Large Vision Language Model.A new zero-shot framework to explore and search for the language descriptive targets in unknown environment based on Large Vision Language Model. - ybgdgh/VLN-Game

Language descriptive instruction Link:

vlobjectnav_hm3d.zip

Acknowledgements

Part of this work was conducted during the first author’s internship at Tencent Robotics X.

Page updated

Google Sites

Report abuse