LookOut: Real-World Humanoid Egocentric Navigation

Boxiao Pan, Adam W. Harley, C. Karen Liu*, Leonidas J. Guibas*
(* Equal advising)

Stanford University

International Conference on Computer Vision (ICCV), 2025

Introduction

We introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. This setting differs from prior works primarily in two aspects:

- The predictions include both translations for collision-free navigation, and rotations for learning active information-gathering behaviors we humans demonstrate through head-turning events.
- We consider real-world scenarios with significant presence of both static and dynamic obstacles.

Solving this task in a generalized setting has important implications for real-world humanoid egocentric navigation.

To solve this problem, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment.

Due to the lack of training data, we further contribute a data collection pipeline using a pair of Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies.

Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments.

Real-world results

Green denotes ground-truths; Red represents model predictions.
Squares are projected camera frustums, representing rotations; urves are translations projected to the ground.

More results

Page updated

Report abuse