ExAug: Robot-Conditioned Navigation Policies via Geometric Experience Augmentation


Noriaki Hirose1,2Dhruv Shah1, Ajay Sridhar1, and Sergey Levine1

IEEE International Conference on Robotics and Automation 2023 (ICRA 2023)


1: University of California, Berkeley,   2: Toyota Motor North America

[ICRA 2023],  [arXiv],  [code]

Abstract

Machine learning techniques rely on large and diverse datasets to have generalization performance. Computer vision, neural language processing, and other applications can often reuse public datasets to train many different models. However, due to  differences in physical configurations, it is challenging to leverage public datasets for training robotic control policies on new robot platforms and/or for new tasks. In this work, we propose a novel framework, ExAug to augment the experiences of different robot platforms from multiple datasets in diverse environments. ExAug leverages a simple principle: by extracting 3D information in the form of a point cloud, we can create much more complex and structured augmentations, utilizing both generating synthetic images and geometric-aware penalization that would have been suitable in the same situation for a different robot, with different size, turning radius, and camera placement. The trained policy is evaluated on two new robot platforms with three different cameras in indoor and outdoor environments with obstacles.

Overview

We propose ExAug to solve the question of how we can train a policy that can control robots with different camera types, camera placements, robot sizes, and velocity constraints in various environments from public datasets.

We propose a new form of data augmentation, ExAug, which augments robot experiences in public datasets to train a robot-conditioned control policy for image-based navigation. At first, we use view augmentation to augment the observations in the datasets. Then, we train the policy by penalizing a novel geometry-aware objective to understand the robot parameters. Our key insight is that we can generate such augmentations for free if we have access to the scene's geometry. We use the reconstructed scenes from self-supervised monocular depth estimation for both processes. As a result, ExAug enables us to train a generalized policy across a range of robot parameters, such as camera intrinsic parameters, viewpoints, robot sizes, and velocity limitations.

Navigation

ExAug enables us to train a policy for our mobile robot with arbitrary cameras at arbitrary poses using public datasets. In addition, our policy can control robots with different sizes and velocity limitations in various environments. First, we show navigation with three different cameras, 1) spherical camera (left), 2) wide FoV camera (center), and 3) narrow FoV camera (right). Our method can work in challenging indoor and outdoor environments with static and dynamic obstacles (e.g., pedestrians). 

nav_sph.mp4

Spherical camera

nav_fish.mp4

Wide FoV camera (fisheye)

nav_realsense.mp4

Narrow FoV camera (realsense)

Next, we evaluate our policy on another new robot platform, the LoCoBot.  The LoCoBot has its camera at a different pose and it is operating under different velocity constraints. It successfully navigated challenging outdoor environments with some unseen obstacles.

nav_locobot.mp4

We show differences in robot behaviors by giving different robot parameters to our policy. In our first video, we vary the robot sizes we input. With a small robot radius, our control policy can use a narrow path to reach the goal position. Howevers, with a larger robot radius, our policy correctly gives ups up navigating the narrow path and traverses the more open path to avoid collisions.

variation_robot_size.mp4

The following videos show differences in robot behaviors by giving different velocity limitations. By inputing the velocity limitation into the policy, it can generate appropriate velocity commands within its upper and lower boundaries. In this case, the robots with a more servere limitation can avoid collisions with unseen obstacles and reach the goal position. The robot  with a less severe limitation can turn sharply, making more space against the obstacles for safer collision avoidance.

variation_velocity_limitation.mp4

In the quantitative analysis in our manuscript, we placed following obstacles after subgoal image collection on or between subgoal poistions. We train our policy by penalizing a geometric-aware objective, which allows it to recognize the new obstacles during inference time and reach the position of the goal image without any collisions.

View Augmentation

We augment the images in the public datasets to use new cameras at arbitrary poses for navigation. Our view augmentation generates the images by projecting the estimated depth on the image plane. In the left side video, we show three different synthetic images from the spherical camera image in the GO Stanford dataset. In the right side video, we show examples from the RECON dataset. In training, we do forward calculations by feeding these synthetic images, and we update the policy by minimizing our geometric-aware objectives 

view_aug_gs.mp4

Synthetic images from GO Stanford dataset

view_aug_recon.mp4

Synthetic images from RECON dataset

In the following figures, we compare the synthetic images generated by our view augmentation with the raw images from the target camera. Our view augmentation can generate the image corresponding to the target camera. Since the synthetic images use pixel values on the boundaries of the spherical image, they are a bit blurry and noisy in some cases. However, the synthetic images include sufficient information to understand the geometric information needed to train the policy.

Self-supervised monocular depth estimation

We leverage the estimated point cloud from monocular camera image for both view augmentation and training the policy with a geometric-aware objective. To estimate the point cloud, we train the depth network using self-supervised monocular depth estimation without the ground truth depth. Our previous method allows us to use multiple datasets without camera intrinsic parameters and learn the depth network from highly distorted images as well as pinhole images.  Here, we show some examples of depth estimation. In training, we give supervision for pose estimation by using odometries or velocities to learn scale in depth estimation. Details of the implementations are shown in the original paper.


Video

Full video (3 minutes)

Dataset

In order to train a generalizable policy that can leverage diverse datasets, we pick three publicly available navigation datasets that vary in their collection platform, visual sensors, and dynamics. This allows us to train policies  that can learn shared representations across these widely varying datasets, and generalize to new environments (both indoors and outdoors) and new robots.

GO Stanford dataset (camera: sperical camera,  platform: Turtlebot2,  location:  buildings in Stanford campus)

RECON dataset (camera: wide FoV camera, platform: Jackal, location: grassy fields)

KITTI odometry dataset (camera: narrow FoV camera, platform: vehicle, location: european city)

Appendix

We provide additional explanations for the masked normalization weight in the geometric-aware objective, our view synthesis, and our navigation system in real environments. The details are also explained in the appendix of the arXiv version.

Masked normalization weight

Process of our view synthesis

Navigation system

BibTeX

@inproceedings{hirose2023exaug,

  title={ExAug: Robot-conditioned navigation policies via geometric experience augmentation},

  author={Hirose, Noriaki and Shah, Dhruv and Sridhar, Ajay and Levine, Sergey},

  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},

  pages={4077--4084},

  year={2023},

  organization={IEEE}

}