Driven by Vision

Learning Navigation by Visual Localization and Trajectory Prediction

Learning to Localize and Navigate from Vision Alone


When driving, people make decisions based on current traffic as well as their desired route. They have a mental map of known routes and are often able to navigate without needing directions. Current published self-driving models improve their performances when using additional GPS information. Here we aim to push forward self-driving research and perform route planning even in the complete absence of GPS at inference time.

Our system learns to predict in real-time vehicle's current location and future trajectory, on a known map, given only the raw video stream and the final destination. Trajectories consist of instant steering commands that depend on present traffic, as well as longer-term navigation decisions towards a specific destination. Along with our novel proposed approach to localization and navigation from visual data, we also introduce a novel large dataset in an urban environment, which consists of video and GPS streams collected with a smartphone while driving. The GPS is automatically processed to obtain supervision labels and to create an analytical representation of the traversed map.

In tests, our solution outperforms published State of the Art methods on visual localization and steering and provides reliable navigation assistance between any two known locations. We also show that our system can adapt to short and long-term changes in weather conditions or the structure of the urban environment.


Below we make our entire Urban European Driving dataset and the Code publicly available.

The High-level Structure of the Visual Localization and Navigation System

Our system learns to predict location and future trajectory conditioned on destination using only visual input. The system is composed of two parts, the Localization by Vision Module (LOVis) and the Navigation by Vision Module (NAVis).

The first ConvNet module (LOVis) learns to predict the location and heading on the map by an image segmentation-based approach. The segmented dot on the map (bottom location map) shows the x-y coordinates of the location, while the angle between the half-circle (top map) and the full localization circle (bottom map) gives the direction angle in exact degrees.

The future vehicle's heading angle in exact degrees, can also be estimated from the predicted trajectory, which provides all information about pose, direction of movement for up to seven seconds in the future.

For the navigation task, road map segments are cropped around the location, one showing all directions, and the other one only the intended route, conditioned on the final destination. From road map crops and video frames around the current moment in time, the second model predicts the navigation trajectory for the next seven seconds, which is also conditioned on the final destination. The trajectory estimation task becomes very interesting and useful especially in intersections with several possible routes, as shown in the demos below.

Visual Localization and Navigation Demos

Driven by Vision: Demonstration of Visual Localization


On the right we show the localization capability of our LOVis system from visual input only.

Often GPS-based systems suffer significant delays or fail completely. However, visual data is always instantly available at a very low cost. Our Driven by Vision system is able to geo-localize the vehicle with near-GPS accuracy using only visual information.

As it can be seen in the demo, the red dot (our visual location prediction) is very close to the blue dot (the GPS-based location prediction). In fact, they almost always completely overlap.

Navigation demos with NAVis: the two videos display two different cases, each with a different final destination, while exiting the same intersection.

Please observe how the NAVis system is able to provide the correct trajectory, which is different in each case due to a different final destination.

At the top of the video frames, please see the two local maps (with roads shown in white): one map (on the left) shows all roads of the map, while the other map (on the right) shows only the roads belonging to the correct route to the destination. Both maps, together with RGB frames around the current moment in time, are given to the NAVis network in order to estimate the correct trajectory for the next seven seconds, which aims to provide the correct directions to the final destination, while respecting driving rules (e.g. stop at stoplight when red) avoiding collisions with the vehicles in front (e.g. go around the stopped car in front).

In the videos, the red trajectory is our prediction, while the blue one is the correct one, as recorded using GPS, while the pilot was driving the correct route during testing. Such ground-truth collected using human pilots is needed for training (on the training set of videos) and evaluation (on the test set if videos, not seen during training).

Driven by Vision: Demonstration of Visual Navigation

Traditional GPS-based systems often suffer delays, which makes navigation very difficult especially in large intersections, with many possible routes and heavy traffic. A delay in the GPS navigation response or a complete failure, could make it impossible for drivers to make the correct turn on time.

An intelligent system based on visual information, which is instantly available at low cost, could make the navigation prediction quicker, often more robust and even safer for driving in heavy traffic when decisions need to be made very quickly.

Our Driven by Vision system is able to predict the correct trajectory for the next seven seconds even in intersections with many roads going out as possible routes. The trajectory (defined as pose as a function of time), implicitly provides the future speed and acceleration, thus giving a richer driving guidance based on the current traffic conditions. This information is more complete and more helpful than the one given by current GPS-based or other vision-based systems in the literature, which only predict the instantaneous location and steering angle.

The trajectory predicted by our system in large intersections depends on the final destination, as it can also be seen on the left navigation demos. Note that in the demos the trajectories are defined by the small dots, which define locations, relative to the current pose of the vehicle, as functions of time: one dot per each second in the future.

Code and Dataset Available for Research Use


We make our Urban European Driving (UED) Dataset available here.

We also provide the training and testing code for our Driven by Vision system here.


Contact

For any questions regarding the usage of the code and the dataset please contact us:

Iulia Paraicu at iulia.paraicu@gmail.com and Marius Leordeanu at leordeanu@gmail.com.

Additional Technical and Experimental Details

Automatic Construction of the Analytical Driving Map

A driving map of the known area, in which our visual-based navigation system can function properly is automatically constructed during training, when data is collected, as follows.

The X-Y coordinates of the locations visited are collected during training using GPS data. Such coordinates are then used to fit path segments between different map nodes (intersections), using the method of Least Squares, as higher-order polynomials of X and Y coordinates, as functions of one parameter, representing the distance travelled between the two nodes that they connect (the start and end nodes). Nodes (intersections) in the map graph are automatically found as intersections of two or more path segments. Then, the final analytical map becomes a large graph, consisting of all the nodes (intersections) and edges (analytical path segments), which represent road segments that connect those nodes.

In the figure above we present: A - The graph structure of the driving map. B - Cropped sections of the analytically obtained map overlapped with the corresponding map regions from Google Maps. C - The sub-part (orange) of the map crossed in the second phase of data collection, after a period of 14 months, in which significant structural and architectural changes along the map road took place.

As presented next, in the example images below, our system is able to adapt to such structural and environmental changes, being robust to changes in weather, time of the day, structural changes and other noises (e.g. motion blur).

The system is robust to different structural, environemental and weather changes

Above we present representative examples of changes in environmental conditions.

A. Long-term structural changes after 14 months. B. Short-term changes in weather conditions (artificially simulated).

The system was trained using images spanning a 14-months period, which were taken under different conditions of weather, times of the day, traffic and other environment changes. Moreover, we also added to the training set, versions of the same images with various weather conditions artificially simulated, for a more robust training.

Our system is able to adapt to such short and long-term environmental changes, showing excellent localization and trajectory prediction accuracy, when tested on novel, unseen data, often undergoing severe, real-world condition changes.

The Deep Net Architectures of the Driven by Vision System

In the figures above we present the networks' architectures that we propose for localization module LOVis (A) and navigation module NAVis (B).

The localization module (A) outputs the vehicle 2D pose (location and orientation) using a segmentation representation: a small circle on the map representing the vehicle's location and a half-circle, of the same size and at the same location, whose orientation corresponds to the orientation of the vehicle w.r.t the world coordinate system.

The navigation module (B) outputs the predicted future trajectory (for the next seven seconds) by seven 2D points, with one point per each second in the future.

Papers


Iulia Paraicu and Marius Leordeanu, Driven by Vision: Learning Navigation by Visual Localization and Trajectory Prediction, submitted to Special Issue on Sensors and Computer Vision Techniques for 3D Object Modeling, Sensors. Under review.

Iulia Paraicu and Marius Leordeanu, Learning Navigation by Visual Localization and Trajectory Prediction, arXiv preprint arXiv:1910.02818, 2019. PDF

Team members

MSc. Iulia Paraicu

Polytechnic University of Bucharest

MSc. Victor Robu

Polytechnic University of Bucharest

Institute of Mathematics of the Romanian Academy

Prof. Dr. Marius Leordeanu

Polytechnic University of Bucharest

Institute of Mathematics of the Romanian Academy

Funding

This work is funded through UEFISCDI, under projects:

EEA Norway Grant EEA-RO-2018-0496 and PN-III-P1-1.2-PCCDI-2017-0734