Cognitive Mapping and Planning for Visual Navigation

Saurabh Gupta^" James Davidson" Sergey Levine^" Rahul Sukthankar" Jitendra Malik^"

^ UC Berkeley "Google

To Appear at Computer Vision and Pattern Recognition (CVPR) 2017


We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person viewpoints and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as 'go to a chair'.

Problem Statement

We study the problem of visual navigation in novel environments. We study geometric tasks (where task is specified in terms of a offset relative to robot's current location) and semantic tasks (where task is specified in terms of reaching a particular object category).


Our learned navigation network consists of a mapper and planner module. The mapper writes into a latent memory that corresponds to an egocentric map of the environment, while the planner uses this memory to output navigational actions. The map is not supervised explicitly, but rather emerges naturally from the learning process.

The mapper module processes first person images from the robot and integrates the observations into a latent memory, which corresponds to an egocentric map of the top-view of the environment. The mapping operation is not supervised explicitly – the mapper is free to write into memory whatever information is most useful for the planner. In addition to filling in obstacles, the mapper also stores confidence values in the map, which allows it to make probabilistic predictions about unobserved parts of the map by exploiting learned patterns.

The hierarchical planner takes the egocentric multi-scale belief of the world output by the mapper and uses value iteration expressed as convolutions and channel-wise max-pooling to output a policy. The planner is trainable and differentiable and back-propagates gradients to the mapper. The planner operates at multiple scales (scale 0 is the finest scale) of the problem which leads to efficiency in planning.


Experiments are conducted on static simulated environments consisting of real-world 3D scans. We report performance on held-out novel test environments. We report mean distance to goal, 75th percentile distance to goal and success rate at end of episode for our proposed method (CMP), and a reactive baseline and a LSTM based baseline. Top table show results for geometric task, bottom table shows results for semantic task (go to a 'chair', 'door' or 'table').

In this video we show some representative examples of successful and failed navigations for our proposed model on a held-out novel environment (the agent was not trained on this environment). Note that for results shown in the video, the agent uses first person depth images as input but we use RGB images for easier visualization.