LM-Nav: Robotic Navigation with Large Pre-Trained Models
of Language, Vision, and Action

Dhruv Shah*, Błażej Osiński*, Brian Ichter, Sergey Levine

UC Berkeley, University of Warsaw, Robotics @ Google

arXiv | Summary Video | Code | Interactive Colab | 2MP Feature Video | BibTeX

Conference on Robot Learning (CoRL) 2022
Auckland, New Zealand

Oral Talk at Foundation Models for Decision Making Workshop at NeurIPS 2022

Oral Talk at Bay Area Machine Learning Symposium (BayLearn) 2022

Summary Video

Problem Statement

Given a high-level textual instruction for navigating a real-world environment, how can we get a robot to follow it solely from egocentric visual observations?

Main Idea

Our key insight is that we can utilize pre-trained models of images and language to provide a textual interface to visual navigation models!

Given a bunch of observations in the target environment, the goal-conditioned distance function (part of the Visual Navigation Model) is used to infer connectivity between them and construct a topological graph of connectivity in the environment.

The Large Language Model is used to parse the natural language instruction into a sequence of landmarks that can serve as intermediary subgoals for navigation.

The Vision-and-Language Model is used to ground the robot’s visual observations in landmark phrases. The VLM infers a joint probability distribution over the landmark descriptions and the images (which form nodes in the above graph).

Using VLM's probability distribution and VNM's inferred graph connectivity, a novel search algorithm is used to retrieve an optimal plan in the environment that (i) satisfies the original instruction, and (ii) is the shortest path in the graph that does so.

This plan is then executed by the goal-conditioned policy which is a part of VNM.

LM-Nav Following Instructions in the Real-World

For more experiment videos, click here!

Disambiguating Instructions

Language is an inherently ambiguous means of task specification, and there may be multiple paths to the goal that satisfy the given instructions. In such cases, an instruction following system must be able to disambiguate paths in the environment using fine-grained modifications to the instruction. We show an experiment where LM-Nav is tasked with two slightly different instructions to the same goal.

LM-Nav succeeds in following these instructions by discovering two separate paths in the environment, as desired.

BibTeX

@inproceedings{

shah2022lmnav,

title={{LM}-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action},

author={Dhruv Shah and Blazej Osinski and Brian Ichter and Sergey Levine},

booktitle={6th Annual Conference on Robot Learning},

year={2022},

url={https://openreview.net/forum?id=UW5A3SweAH}

}

LM-Nav: Robotic Navigation with Large Pre-Trained Modelsof Language, Vision, and Action