LM-Nav: Robotic Navigation with Large Pre-Trained Models
of Language, Vision, and Action
of Language, Vision, and Action
Dhruv Shah*, Błażej Osiński*, Brian Ichter, Sergey Levine
UC Berkeley, University of Warsaw, Robotics @ Google
Conference on Robot Learning (CoRL) 2022
Auckland, New Zealand
Oral Talk at Foundation Models for Decision Making Workshop at NeurIPS 2022
Oral Talk at Bay Area Machine Learning Symposium (BayLearn) 2022
Summary Video
Problem Statement
Given a high-level textual instruction for navigating a real-world environment, how can we get a robot to follow it solely from egocentric visual observations?
Main Idea
Our key insight is that we can utilize pre-trained models of images and language to provide a textual interface to visual navigation models!
Given a bunch of observations in the target environment, the goal-conditioned distance function (part of the Visual Navigation Model) is used to infer connectivity between them and construct a topological graph of connectivity in the environment.
The Large Language Model is used to parse the natural language instruction into a sequence of landmarks that can serve as intermediary subgoals for navigation.
The Vision-and-Language Model is used to ground the robot’s visual observations in landmark phrases. The VLM infers a joint probability distribution over the landmark descriptions and the images (which form nodes in the above graph).
Using VLM's probability distribution and VNM's inferred graph connectivity, a novel search algorithm is used to retrieve an optimal plan in the environment that (i) satisfies the original instruction, and (ii) is the shortest path in the graph that does so.
This plan is then executed by the goal-conditioned policy which is a part of VNM.
Disambiguating Instructions
Language is an inherently ambiguous means of task specification, and there may be multiple paths to the goal that satisfy the given instructions. In such cases, an instruction following system must be able to disambiguate paths in the environment using fine-grained modifications to the instruction. We show an experiment where LM-Nav is tasked with two slightly different instructions to the same goal.
LM-Nav succeeds in following these instructions by discovering two separate paths in the environment, as desired.
BibTeX
@inproceedings{
shah2022lmnav,
title={{LM}-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action},
author={Dhruv Shah and Blazej Osinski and Brian Ichter and Sergey Levine},
booktitle={6th Annual Conference on Robot Learning},
year={2022},
url={https://openreview.net/forum?id=UW5A3SweAH}
}