首页

NOLO: Navigate Only Look Once

Abstract

The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without fine-tuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy.

Introduction

A large body of research focusing on 3D visual indoor navigation mostly rely on heterogeneous modules and sensors. Cumulative error may hinder policy generalizability and deployment.

- modules: SLAM modules, object segmentation modules, key point matchers, depth estimators

- sensors: GPS+Compass, RGB-D cameras, IMU

Human possess an innate ability to navigate anywhere after watching a traversal video, namely Video Navigation. But learning video navigation policy for mobile robots is challenging:

- lack of actual actions & topological structure of scenes

- implicit intentions inferred from context video

- only RGB images

NOLO is promising to equip mobile agents with human-like video navigation capabilities by learning an in-context policy purely from 30s egocentric RGB video clips.

Simulation Navigation

left part: training scenes in ROBOTHOR
right part: two groups of (context video, target image, policy rollout)

- blue part: policy inference in scenes with seen topological structure but unseen layout

- orange part: policy inference in totally unseen scenes with different target objects

final.mp4

Real-World Navigation

0_book_.mp4

0_box_.mp4

0_cola_.mp4

0_cup_.mp4

0_glue_.mp4

0_novel_.mp4

Citation

@article{zhou2024nolo,

title={NOLO: Navigate Only Look Once},

author={Zhou, Bohan and Zhang, Zhongbin and Wang, Jiangxing and Lu, Zongqing},

journal={arXiv preprint arXiv:2408.01384},

year={2024}

}

Page updated

Google Sites

Report abuse