ExAug: Robot-Conditioned Navigation Policies via Geometric Experience Augmentation
Noriaki Hirose1,2, Dhruv Shah1, Ajay Sridhar1, and Sergey Levine1
IEEE International Conference on Robotics and Automation 2023 (ICRA 2023)
1: University of California, Berkeley, 2: Toyota Motor North America
Abstract
Machine learning techniques rely on large and diverse datasets to have generalization performance. Computer vision, neural language processing, and other applications can often reuse public datasets to train many different models. However, due to differences in physical configurations, it is challenging to leverage public datasets for training robotic control policies on new robot platforms and/or for new tasks. In this work, we propose a novel framework, ExAug to augment the experiences of different robot platforms from multiple datasets in diverse environments. ExAug leverages a simple principle: by extracting 3D information in the form of a point cloud, we can create much more complex and structured augmentations, utilizing both generating synthetic images and geometric-aware penalization that would have been suitable in the same situation for a different robot, with different size, turning radius, and camera placement. The trained policy is evaluated on two new robot platforms with three different cameras in indoor and outdoor environments with obstacles.
Overview
We propose ExAug to solve the question of how we can train a policy that can control robots with different camera types, camera placements, robot sizes, and velocity constraints in various environments from public datasets.
We propose a new form of data augmentation, ExAug, which augments robot experiences in public datasets to train a robot-conditioned control policy for image-based navigation. At first, we use view augmentation to augment the observations in the datasets. Then, we train the policy by penalizing a novel geometry-aware objective to understand the robot parameters. Our key insight is that we can generate such augmentations for free if we have access to the scene's geometry. We use the reconstructed scenes from self-supervised monocular depth estimation for both processes. As a result, ExAug enables us to train a generalized policy across a range of robot parameters, such as camera intrinsic parameters, viewpoints, robot sizes, and velocity limitations.
Navigation
ExAug enables us to train a policy for our mobile robot with arbitrary cameras at arbitrary poses using public datasets. In addition, our policy can control robots with different sizes and velocity limitations in various environments. First, we show navigation with three different cameras, 1) spherical camera (left), 2) wide FoV camera (center), and 3) narrow FoV camera (right). Our method can work in challenging indoor and outdoor environments with static and dynamic obstacles (e.g., pedestrians).
Spherical camera
Wide FoV camera (fisheye)
Narrow FoV camera (realsense)
Next, we evaluate our policy on another new robot platform, the LoCoBot. The LoCoBot has its camera at a different pose and it is operating under different velocity constraints. It successfully navigated challenging outdoor environments with some unseen obstacles.
We show differences in robot behaviors by giving different robot parameters to our policy. In our first video, we vary the robot sizes we input. With a small robot radius, our control policy can use a narrow path to reach the goal position. Howevers, with a larger robot radius, our policy correctly gives ups up navigating the narrow path and traverses the more open path to avoid collisions.
The following videos show differences in robot behaviors by giving different velocity limitations. By inputing the velocity limitation into the policy, it can generate appropriate velocity commands within its upper and lower boundaries. In this case, the robots with a more servere limitation can avoid collisions with unseen obstacles and reach the goal position. The robot with a less severe limitation can turn sharply, making more space against the obstacles for safer collision avoidance.
In the quantitative analysis in our manuscript, we placed following obstacles after subgoal image collection on or between subgoal poistions. We train our policy by penalizing a geometric-aware objective, which allows it to recognize the new obstacles during inference time and reach the position of the goal image without any collisions.
View Augmentation
We augment the images in the public datasets to use new cameras at arbitrary poses for navigation. Our view augmentation generates the images by projecting the estimated depth on the image plane. In the left side video, we show three different synthetic images from the spherical camera image in the GO Stanford dataset. In the right side video, we show examples from the RECON dataset. In training, we do forward calculations by feeding these synthetic images, and we update the policy by minimizing our geometric-aware objectives
Synthetic images from GO Stanford dataset
Synthetic images from RECON dataset
In the following figures, we compare the synthetic images generated by our view augmentation with the raw images from the target camera. Our view augmentation can generate the image corresponding to the target camera. Since the synthetic images use pixel values on the boundaries of the spherical image, they are a bit blurry and noisy in some cases. However, the synthetic images include sufficient information to understand the geometric information needed to train the policy.
Self-supervised monocular depth estimation
We leverage the estimated point cloud from monocular camera image for both view augmentation and training the policy with a geometric-aware objective. To estimate the point cloud, we train the depth network using self-supervised monocular depth estimation without the ground truth depth. Our previous method allows us to use multiple datasets without camera intrinsic parameters and learn the depth network from highly distorted images as well as pinhole images. Here, we show some examples of depth estimation. In training, we give supervision for pose estimation by using odometries or velocities to learn scale in depth estimation. Details of the implementations are shown in the original paper.
Video
Full video (3 minutes)
Dataset
In order to train a generalizable policy that can leverage diverse datasets, we pick three publicly available navigation datasets that vary in their collection platform, visual sensors, and dynamics. This allows us to train policies that can learn shared representations across these widely varying datasets, and generalize to new environments (both indoors and outdoors) and new robots.
RECON dataset (camera: wide FoV camera, platform: Jackal, location: grassy fields)
KITTI odometry dataset (camera: narrow FoV camera, platform: vehicle, location: european city)
Appendix
We provide additional explanations for the masked normalization weight in the geometric-aware objective, our view synthesis, and our navigation system in real environments. The details are also explained in the appendix of the arXiv version.
BibTeX
@inproceedings{hirose2023exaug,
title={ExAug: Robot-conditioned navigation policies via geometric experience augmentation},
author={Hirose, Noriaki and Shah, Dhruv and Sridhar, Ajay and Levine, Sergey},
booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
pages={4077--4084},
year={2023},
organization={IEEE}
}