World Model Based 

Sim2Real Transfer 

for 

Visual Navigation

Chen Liu*, Kiran Lekkala*, Laurent Itti

  iLab at University of Southern California

Abstract

Sim2Real transfer has gained popularity because it helps transfer from inexpensive simulators to real world. This paper presents a novel system that fuses components in a traditional World Model into a robust system, trained entirely within a simulator, that Zero-Shot transfers to the real world. To facilitate transfer, we use an intermediary representation that are based on Bird’s Eye View (BEV) images. Thus, our robot learns to navigate in a simulator by first learning to translate from complex First-Person View (FPV) based RGB images to BEV representations, then learning to navigate using those representations. Later, when tested in the real world, the robot uses the perception model that translates FPV-based RGB images to embeddings that are used by the downstream policy. The incorporation of state-checking modules using Anchor images and Mixture Density LSTM not only interpolates uncertain and missing observations but also enhances the robustness of the model when exposed to the real-world environment. We trained the model using data collected using a Differential drive robot in the CARLA simulator. Our methodology’s effectiveness is shown through the deployment of trained models onto a Real world Differential drive robot. Lastly we release a comprehensive codebase, dataset and models for training and deployment that are available to the public.

Approach Overview

In this paper, we formulate a new setting for Zero-shot Sim2Real transfer for Visual Navigation without Maps. By decoupling the Perception (CNN) from the Control (Policy), we get advantages of being able to deploy the model onto a future robot with unknown dynamics and sample efficient reinforcement learning. To enhance the robustness and zero-shot sim-2-real transfer of the Perception model, we employ an Memory (LSTM) in the pipeline. We conduct separate training for these three models using the data obtained from the simulator environment. For sim-to-real transfer, we propose a method that converts first-person view (FPV) images into bird's-eye view (BEV) embeddings. These BEV embeddings act as intermediate representations, effectively bridging the gap between simulated training environments and real-world testing for downstream models.

Perception Model

We propose a method for training a Perception model on simulated data to generate BEV representations from RGB FPV images through Contrastive Learning. Impressively, without any fine-tuning, the model robustly translates real-world street view images into binary BEV images during testing, demonstrating significant generalization capabilities.


Temporal Model with Robustness Modules

We enhance the robustness of Perception by integrating a Memory model, allowing for predictions based on a sequence of historical observations rather than a single instance. This Memory model serves a dual purpose: it facilitates Temporal State Checking (TSC) to discard error-prone predictions and supports state interpolation during periods of observation latency. Prediction stability is further improved through Anchor State Checking (ASC), wherein predictions are refined against a pre-defined distribution before being conveyed to the Policy or future states. The Robustness modules are strategically designed to enhance zero-shot sim-to-real transfer performance.

Open Source Codebases

We are releasing four open-source repositories:

(1) a differential-drive robot simulator Schoomatic

Schoomatic, a robot simulator built on CARLA and Unreal Engine 4, embracing all intrinsic CARLA features such as NPC traffic, variable weather conditions and global waypoint planning. Our unique contribution involves importing and development of dynamics and collision specifically for a differential-drive robot.  Additionally, our codebase provides integration with RLLib and ROS environments.

(2) code for pretraining CNN and LSTM

This repository provides extensive resources, including code, simulated datasets, and pre-trained weights essential for reproducing the CNN and LSTM in our method. The CNN pretraining involves Variational AutoEncoder (VAE) and contrastive learning. In parallel, the LSTM is pretrained on random trajectories gathered from the Schoomatic simulator. We use PyTorch for implementation.

(3) code for visual navigation reinforcement learning

Focused on the point-to-point visual navigation task for differential-drive robots, this repository features training a control policy using Proximal Policy Optimization (PPO). Note that the policy is trained independently of the implementation of CNN or LSTM. This repository therefore is compatible with other perception methods by replacing our observation space. 

(4) code for real-world deployment

We deploy our models on the real-world robot for testing. This codebase contains the implementation of robustness modules and presents a complete inference pipeline within our revised world model framework. The ROS package processes observations from the RGB sensor and outputs action commands accordingly. For environments demanding accelerated CNN inference, we've ensured compatibility with Coral Edge TPU implementations.