ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

ICML 2023

Kaiwen Zhou*, Kaizhi Zheng*, Connor Pryor*, Yilin Shen#, Hongxia Jin#, Lise Getoor*, Xin Eric Wang*

UC Santa Cruz*, Samsung Research America#

Paper

Reproduction by request

Abstract

The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects and environments. In this work, we present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments. First, ESC leverages a pre-trained vision-and-language model for open-world prompt-based grounding and large language models (e.g. Deberta, ChatGPT) for navigation-related room and object-level commonsense reasoning. Then ESC converts commonsense knowledge into navigation actions by modeling it as soft logic predicates for efficient exploration. Extensive experiments on MP3D, HM3D, and RoboTHOR benchmarks show that ESC improves significantly over baselines and achieves new state-of-the-art results for zero-shot object navigation (e.g., 288% relative Success Rate improvement than CoW on MP3D).

Demo videos

video_2.mp4

video_1.mp4

Method

Our method enables the agent to efficiently explore the environment with commonsense knowledge from LLMs. To achieve this, we first acquire object-level and room-level commonsense from pre-trained LLMs.

During navigation, the agent first recognizes the objects and room information from ego-centric views. Then it maps the object, room, and spatial information into a semantic map. A navigation map including frontiers and obstacle information is also constructed and maintained from depth information and agent pose.

Based on the semantic map and the commonsense knowledge, the agent sequentially selects one frontier and navigates to it to explore the environment. After it detects a goal object, the agent will directly navigate to it.

Questions?

Check the paper for more details. The source code for the method cannot be publicly released due to company policy, but don't hesitate to contact Kaiwen Zhou at kzhou35@ucsc.edu about re-implementation.

Citation

@inproceedings{zhou2023esc,

author = {Zhou, Kaiwen and Zheng, Kaizhi and Pryor, Connor and Shen, Yilin and Jin, Hongxia and Getoor, Lise and Wang, Xin Eric},

title = {ESC: Exploration with Soft Commonsense Constraints for Zero-Shot Object Navigation},

year = {2023},

publisher = {JMLR.org},

booktitle = {Proceedings of the 40th International Conference on Machine Learning},

articleno = {1806},

numpages = {14},

location = {Honolulu, Hawaii, USA},

series = {ICML'23}

}