LM-Nav: Robotic Navigation with Large Pre-Trained Models
of Language, Vision, and Action

Dhruv Shah*, Błażej Osiński*, Brian Ichter, Sergey Levine

UC Berkeley,  University of Warsaw,  Robotics @ Google

Conference on Robot Learning (CoRL) 2022
Auckland, New Zealand


Oral Talk at Foundation Models for Decision Making Workshop at NeurIPS 2022

Oral Talk at Bay Area Machine Learning Symposium (BayLearn) 2022

Summary Video

Problem Statement



Given a high-level textual instruction for navigating a real-world environment, how can we get a robot to follow it solely from egocentric visual observations?

Main Idea

Our key insight is that we can utilize pre-trained models of images and language to provide a textual interface to visual navigation models!













LM-Nav Following Instructions in the Real-World

Disambiguating Instructions

Language is an inherently ambiguous means of task specification, and there may be multiple paths to the goal that satisfy the given instructions. In such cases, an instruction following system must be able to disambiguate paths in the environment using fine-grained modifications to the instruction. We show an experiment where LM-Nav is tasked with two slightly different instructions to the same goal.


LM-Nav succeeds in following these instructions by discovering two separate paths in the environment, as desired.

BibTeX

@inproceedings{

shah2022lmnav,

title={{LM}-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action},

author={Dhruv Shah and Blazej Osinski and Brian Ichter and Sergey Levine},

booktitle={6th Annual Conference on Robot Learning},

year={2022},

url={https://openreview.net/forum?id=UW5A3SweAH}

}