VLN-Game: Vision-Language Equilibrium Search
for Zero-Shot Semantic Navigation
Bangguo Yu¹², Yuzhen Liu¹✉️, Lei Han¹, Hamidreza Kasaei²✉️, Tingguang Li¹✉️, and Ming Cao²
¹Tencent Robotics X ²University of Groningen
Bangguo Yu¹², Yuzhen Liu¹✉️, Lei Han¹, Hamidreza Kasaei²✉️, Tingguang Li¹✉️, and Ming Cao²
¹Tencent Robotics X ²University of Groningen
Paper Link: https://arxiv.org/pdf/2411.11609
Following human instructions to explore and search for a specified target in an unfamiliar environment is a crucial skill for mobile service robots. Most of the previous works on object goal navigation have typically focused on a single input modality as the target, which may lead to limited consideration of language descriptions containing detailed attributes and spatial relationships. To address this limitation, we propose VLN-Game, a novel zero-shot framework for visual target navigation that can process object names and descriptive language targets effectively. To be more precise, our approach constructs a 3D object-centric spatial map by integrating pre-trained visual-language features with a 3D reconstruction of the physical environment. Then, the framework identifies the most promising areas to explore in search of potential target candidates. A game-theoretic vision language model is employed to determine which target best matches the given language description. Experiments conducted on the Habitat-Matterport 3D (HM3D) dataset demonstrate that the proposed framework achieves state-of-the-art performance in both object goal navigation and language-based navigation tasks. Moreover, we show that VLN-Game can be easily deployed on real-world robots. The success of VLN-Game highlights the promising potential of using game-theoretic methods with compact vision-language models to advance decision-making capabilities in robotic systems.
This framework utilizes posed RGB-D frames to generate a 3D object-centric map and an exploration map for robot navigation. Target descriptions are parsed by LLM to set the primary navigation goals. Using CLIP-based similarity assessments, the system evaluates the relevance between the target and environmental features to direct exploration activities. Upon detecting a potential target, a game-theoretic vision-language model analyzes spatial relationships described in the target instructions. Achievement of the long-term goal or target identification triggers a local policy that dictates the robot's final actions.
Find a tv_screen
Find a bed
There is a case showing the process of finding a white desk located in front of the cabinet, and with a window in front of it.
Find a black office chair
Find a black office chair between a whiteboard and a desk
Find a sitting person
Find a person sitting on the sofa
Find a brown sofa
Find a brown sofa near a plant
Wrong detection of th TV
Wrong detection of the couch
Part of this work was conducted during the first author’s internship at Tencent Robotics X.