Research projects

2020


Marius Leordeanu, Mihai Pirvu, Dragos Costea, Alina Marcu, Emil Slusanschi, Rahul Sukthankar

We address the challenging problem of semi-supervised learning in the context of multiple visual interpretations of the world by finding consensus in a graph of neural networks. Each graph node is a scene interpretation layer, while each edge is a deep net that transforms one layer at one node into another from a different node. During the supervised phase edge networks are trained independently. During the next unsupervised stage edge nets are trained on the pseudo-ground truth provided by consensus among multiple paths that reach the nets' start and end nodes. These paths act as ensemble teachers for any given edge and strong consensus is used for high-confidence supervisory signal. The unsupervised learning process is repeated over several generations, in which each edge becomes a "student" and also part of different ensemble "teachers" for training other students. By optimizing such consensus between different paths, the graph reaches consistency and robustness over multiple interpretations and generations, in the face of unknown labels. We give theoretical justifications of the proposed idea and validate it on a large dataset. We show how prediction of different representations such as depth, semantic segmentation, surface normals and pose from RGB input could be effectively learned through self-supervised consensus in our graph. We also compare to state-of-the-art methods for multi-task and semi-supervised learning and show superior performance.

Keywords: Semi-supervised learning, Ensemble learning, Knowledge distillation, Generalization


Alina Marcu, Vlad Licaret, Dragos Costea, Marius Leordeanu,

Semantic segmentation is a crucial task for robot navigation and safety. However, current supervised methods require a large amount of pixelwise annotations to yield accurate results. Labeling is a tedious and time consuming process that has hampered progress in low altitude UAV applications. This paper makes an important step towards automatic annotation by introducing SegProp, a novel iterative flow-based method, with a direct connection to spectral clustering in space and time, to propagate the semantic labels to frames that lack human annotations. The labels are further used in semi-supervised learning scenarios. Motivated by the lack of a large video aerial dataset, we also introduce Ruralscapes, a new dataset with high resolution (4K) images and manually-annotated dense labels every 50 frames - the largest of its kind, to the best of our knowledge. Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90% (F-measure), significantly outperforming other state-of-the-art label propagation methods. Moreover, when integrating other methods as modules inside SegProp's iterative label propagation loop, we achieve a significant boost over the baseline labels. Finally, we test SegProp in a full semi-supervised setting: we train several state-of-the-art deep neural networks on the SegProp-automatically-labeled training frames and test them on completely novel videos. We convincingly demonstrate, every time, a significant improvement over the supervised scenario.

Keywords: Semi-supervised learning, Video Semantic Segmentation, UAVs, Scene Understanding, Label Propagation, CNNs


2019


Alina Marcu, Dragos Costea, Vlad Licaret, Marius Leordeanu

Semantic segmentation is a crucial task for robot navigation and safety. However, it requires huge amounts of pixelwise annotations to yield accurate results. While recent progress in computer vision algorithms has been heavily boosted by large ground-level datasets, the labeling time has hampered progress in low altitude UAV applications, mostly due to the difficulty imposed by large object scales and pose variations. Motivated by the lack of a large video aerial dataset, we introduce a new one, with high resolution (4K) images and manually-annotated dense labels every 50 frames. To help the video labeling process, we make an important step towards automatic annotation and propose SegProp, an iterative flow-based method with geometric constrains to propagate the semantic labels to frames that lack human annotations. This results in a dataset with more than 50k annotated frames - the largest of its kind, to the best of our knowledge. Our experiments show that SegProp surpasses current state-of-the-art label propagation methods by a significant margin. Furthermore, when training a semantic segmentation deep neural net using the automatically annotated frames, we obtain a compelling overall performance boost at test time of 16.8% mean F-measure over a baseline trained only with manually-labeled frames.

Keywords: Scene Understanding, Label Propagation, CNNs, UAVs


Learning Navigation by Visual Localization and Trajectory Prediction

Iulia Paraicu, Marius Leordeanu

When driving, people make decisions based on current traffic as well as their desired route. They have a mental map of known routes and are often able to navigate without needing directions. Current self-driving models improve their performances when using additional GPS information. Here we aim to push forward self-driving research and perform route planning even in the absence of GPS. Our system learns to predict in real-time vehicle's current location and future trajectory, as a function of time, on a known map, given only the raw video stream and the intended destination. The GPS signal is available only at training time, with training data annotation being fully automatic. Different from other published models, we predict the vehicle's trajectory for up to seven seconds ahead, from which complete steering, speed and acceleration information can be derived for the entire time span. Trajectories capture navigational information on multiple levels, from instant steering commands that depend on present traffic and obstacles ahead, to longer-term navigation decisions, towards a specific destination. We collect our dataset with a regular car and a smartphone that records video and GPS streams. The GPS data is used to derive ground-truth supervision labels and create an analytical representation of the traversed map. In tests, our system outperforms published methods on visual localization and steering and gives accurate navigation assistance between any two known locations.

Keywords: Autonomous Driving, Navigation, Vision-based localization


2018


Alina Marcu, Dragos Costea, Vlad Licaret, Mihai Pirvu, Emil Slusanschi, Marius Leordeanu

The emergence of relatively low cost UAVs has prompted a global concern about the safe operation of such devices. Since most of them can 'autonomously' fly by means of GPS way-points, the lack of a higher logic for emergency scenarios leads to an abundance of incidents involving property or personal injury. In order to tackle this problem, we propose a small, embeddable ConvNet for both depth and safe landing area estimation. Furthermore, since labeled training data in the 3D aerial field is scarce and ground images are unsuitable, we capture a novel synthetic aerial 3D dataset obtained from 3D reconstructions. We use the synthetic data to learn to estimate depth from in-flight images and segment them into 'safe-landing' and 'obstacle' regions. Our experiments demonstrate compelling results in practice on both synthetic data and real RGB drone footage.

Keywords: UAVs, CNNs, Depth estimation, Safe landing

Oral presentation at UAVision Workshop, European Conference of Computer Vision (ECCV) 2018


Alina Marcu, Dragos Costea, Emil Slusanschi, Marius Leordeanu

We propose a novel multi-task multi-stage neural network that is able to handle the two problems at the same time, in a single forward pass. The first stage of our network predicts pixelwise class labels, while the second stage provides a precise location using two branches. One branch uses a regression network, while the other is used to predict a location map trained as a segmentation task. From a structural point of view, our architecture uses encoder-decoder modules at each stage, having the same encoder structure re-used. Furthermore, its size is limited to be tractable on an embedded GPU. We achieve commercial GPS-level localization accuracy from satellite images with spatial resolution of 1 square meter per pixel in a city-wide area of interest. On the task of semantic segmentation, we obtain state-of-the-art results on two challenging datasets, the Inria Aerial Image Labeling dataset and Massachusetts Buildings.

Keywords: Localization, Multi-task learning, UAVs

Poster at Transylvanian Machine Learning Summer School 2018


Dragos Costea, Alina Marcu, Emil Slusanschi, Marius Leordeanu

Road detection from aerial images is a challenging task for humans and machines alike. Occlusion, the lack of visual cues and slim class borders for other road-like structures (such as pathways or private alleys) make the problem inherently ambiguous, requiring logic that goes beyond the input image. We propose a three-stage method for the task of road segmentation - first, an ensemble of multiple U-Net like CNNs generate binary road masks. Second, another CNN learns to refine roads segmentations based on the fusion of the road maps from the first stage. Third, missing links are added based on the inferred graph to improve segmentation.

Keywords: Road segmentation, CNNs, aerial images

Poster at Deepglobe workshop, in conjunction with Computer Vision and Pattern Recognition (CVPR) 2018


2017


Dragos Costea, Alina Marcu, Emil Slusanschi, Marius Leordeanu

Recognizing roads and intersections in aerial images is a challenging problem in computer vision with many real world applications, such as localization and navigation for unmanned aerial vehicles (UAVs). The problem is currently gaining momentum in computer vision and is still far from being solved. While recent approaches have greatly improved due to the advances in deep learning, they provide only pixel-level semantic segmentations. In this paper, we argue that roads and intersections should be recognized at the higher semantic level of road graphs - with roads being edges that connect nodes. Towards this goal we present a method consisting of two stages. During the first stage, we detect roads and intersections with a novel, dual-hop generative adversarial network (DH-GAN) that segments images at the level of pixels. At the second stage, given the pixelwise road segmentation, we find its best covering road graph by applying a smoothing-based graph optimization procedure. Our approach is able to outperform recent published methods and baselines on a large dataset with European roads.

Keywords: Road segmentation, tracing, optimization, CNNs, aerial images

Oral presentation at UAVision Workshop, International Conference of Computer Vision (ICCV) 2017 | Best presentation award


Alina Marcu, Marius Leordeanu

The importance of visual context in object recognition has been intensively studied over the years. Along with the advent of deep convolutional neural networks (CNN), using contextual information with such systems starts to receive attention in the literature. Regardless of deep learning advances, aerial image analysis still poses many great challenges. Satellite images are often taken under poor lighting conditions and contain low resolution objects, many times occluded. For this particular task, visual context could be of great help, but there are still very few papers that consider context in aerial image understanding.Our work addresses the task of object segmentation in aerial images with a novel dual-stream deep convolutional neural network that integrates the local object appearance and global contextual information into a unified network. Our model learns to combine local object appearance and global semantic knowledge simultaneously and in a complementary way, so that together they form a powerful classifier. Experiments on the Massachusetts Buildings Dataset demonstrate the superiority of our model over state-of-the-art methods. We also introduce two new challenging datasets for the task of buildings and road segmentation. While our local-global model could also be useful in general recognition tasks, we clearly demonstrate the effectiveness of visual context in conjunction with deep nets in aerial image understanding.

Keywords: Road segmentation, road vectorization, optimization, CNNs, aerial images

Oral presentation at AI-CAV workshop in conjunction with AAAI 2017, San Francisco (CA).


2016


Alina Marcu, Marius Leordeanu

Visual context is important in object recognition and it is still an open problem in computer vision. Along with the advent of deep convolutional neural networks (CNN), using contextual information with such systems starts to receive attention in the literature. At the same time, aerial imagery is gaining momentum. While advances in deep learning make good progress in aerial image analysis, this problem still poses many great challenges. Aerial images are often taken under poor lighting conditions and contain low resolution objects, many times occluded by trees or taller buildings. In this domain, in particular, visual context could be of great help, but there are still very few papers that consider context in aerial image understanding. Here we introduce context as a complementary way of recognizing objects. We propose a dual-stream deep neural network model that processes information along two independent pathways, one for local and another for global visual reasoning. The two are later combined in the final layers of processing. Our model learns to combine local object appearance as well as information from the larger scene at the same time and in a complementary way, such that together they form a powerful classifier. We test our dual-stream net- work on the task of segmentation of buildings and roads in aerial images and obtain state-of-the-art results on the Massachusetts Buildings Dataset. We also introduce two new datasets, for buildings and road segmentation, respectively, and study the relative importance of local appearance vs. the larger scene, as well as their performance in combination. While our local-global model could also be useful in general recognition tasks, we clearly demonstrate the effectiveness of visual context in conjunction with deep nets for aerial image understanding.

Keywords: Semantic segmentation, CNNs, aerial images


Dragos Costea, Marius Leordeanu

Aerial image analysis at a semantic level is important in many applications with strong potential impact in industry and consumer use, such as automated mapping, urban planning, real estate and environment monitoring, or disaster relief. The problem is enjoying a great interest in computer vision and remote sensing, due to increased computer power and improvement in automated image understanding algorithms. In this work we address the task of automatic geolocalization of aerial images from recognition and matching of roads and intersections. Our proposed method is a novel contribution that could enable many applications of aerial image analysis when GPS data is not available. We offer a complete pipeline for geolocalization, from the detection of roads and intersections, to the identification of the enclosing geographic region by matching detected intersections to previously learned manually labeled ones, followed by accurate geometric alignment between the detected roads and the manually labeled maps. We test on a novel dataset with aerial images of two European cities and use the publicly available OpenStreetMap project for collecting ground truth roads annotations. We show in extensive experiments that our approach produces highly accurate localizations in the challenging case when we train on images from one city and test on the other and the quality of the aerial images is relatively poor. We also show that the the alignment between detected roads and pre-stored manual annotations can be effectively used for improving the quality of the road detection results.

Keywords: Localization, semantic segmentation, CNNs, aerial images

Poster at British Machine Vision Conference, 2016