Semi-supervised Segmentation of Aerial Videos with Iterative Label Propagation

Alina Marcu

PhD Student,

"Simion Stoilow" Institute of Mathematics of the Romanian Academy

&

University Politehnica of Bucharest

Vlad Licăreț

AI Engineer,

University Politehnica of Bucharest

Dragoș Costea

PhD Student,

"Simion Stoilow" Institute of Mathematics of the Romanian Academy

&

University Politehnica of Bucharest

Prof. Dr. Marius Leordeanu

Supervisor,

"Simion Stoilow" Institute of Mathematics of the Romanian Academy

&

University Politehnica of Bucharest

Accepted as Oral Presentation at the 15th Asian Conference on Computer Vision 2020

Abstract

Semantic segmentation is a crucial task for robot navigation and safety. However, current supervised methods require a large amount of pixelwise annotations to yield accurate results. Labeling is a tedious and time consuming process that has hampered progress in low altitude UAV applications. This paper makes an important step towards automatic annotation by introducing SegProp, a novel iterative flow-based method, with a direct connection to spectral clustering in space and time, to propagate the semantic labels to frames that lack human annotations. The labels are further used in semi-supervised learning scenarios. Motivated by the lack of a large video aerial dataset, we also introduce Ruralscapes, a new dataset with high resolution (4K) images and manually-annotated dense labels every 50 frames - the largest of its kind, to the best of our knowledge. Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90% (F-measure), significantly outperforming other state-of-the-art label propagation methods. Moreover, when integrating other methods as modules inside SegProp's iterative label propagation loop, we achieve a significant boost over the baseline labels. Finally, we test SegProp in a full semi-supervised setting: we train several state-of-the-art deep neural networks on the SegProp-automatically-labeled training frames and test them on completely novel videos. We convincingly demonstrate, every time, a significant improvement over the supervised scenario.

Main contributions:

  • We present SegProp, an iterative semantic label propagation method in video, which outperforms the current state-of-the-art.

  • We introduce Ruralscapes, the largest high-res (4K) video dataset for aerial semantic segmentation with 50,835 fully annotated frames and 12 semantic classes.

  • SegProp can be easily integrated with other label propagation methods and further improve their initial segmentation results.

  • We test SegProp in semi-supervised learning scenarios and compare with state-of-the-art deep neural nets for semantic segmentation.

Overview


SegProp: our method for automatic propagation of semantic labels in the context of semi-supervised segmentation in aerial videos.

Step 1. First, we sample the UAV videos, at regular intervals (e.g. one or two frames per second). The resulting frames are then manually labeled.

Step 2. We automatically propagate labels to the remaining unlabeled frames using our SegProp algorithm - based on class voting, at the pixel level, according to inward and outward label propagation flows between the current frame and an annotated frame. The propagation flows could be based on optical flow, homography transformation or other propagation method, as shown in experiments. SegProp propagates iteratively the segmentation class voting until convergence, improving performance over iterations.

Step 3. We then mix all the generated annotations with the ground truth manual labels to train powerful deep networks for semantic segmentation and significantly improve performance in unseen videos.

SegProp: Semantic Propagation through Time

We present SegProp, an iterative flow-based method to propagate, through space and time, the semantic segmentation labels to video frames that lack human annotations. SegProp propagates labels in an iterative fashion, forward and backward in time from annotated frames, by looping several times through the video and accumulating class votes at each iteration. At convergence the majority class wins. From a theoretical point of view, SegProp relates to spectral MAP labeling in graphical models and has convergence and improvement guarantees. In practice, we demonstrate the effectiveness of SegProp in several ways. First, we show that SegProp is able to propagate labels to unlabeled frames with an accuracy that outperforms competition by a significant margin. Second, we show that other methods for label propagation could be immediately integrated as modules inside the SegProp propagation loop, with a significant boost in performance. And third, we demonstrate SegProp's effectiveness in a semi-supervised learning scenario, in which several state-of-the-art deep networks for semantic segmentation are trained on the automatically annotated frames and tested on novel videos, with an important improvement over the supervised case.

The Ruralscapes Dataset

We have collected 20 high quality 4K videos portraying rural areas. Ruralscapes comprises of various landscapes, different flying scenarios at multiple altitudes and objects across a wide span of scales. The video sequence length varies from 11 seconds up to 2 minutes and 45 seconds. The dataset consists of 17 minutes of drone flight, resulting in a total of 50,835 fully annotated frames with 12 classes. Of those, 1,047 were manually annotated, once every second. To the best of our knowledge, it is the largest dataset for semantic segmentation from real UAV videos.

Labels offer a good level of detail, but, due to the reduced spatial resolution of the small objects, accurate segmentation is difficult. Some classes, such as haystack, are very small by the nature of the dataset, others such as person, also feature close-ups. Manual labeling is a time consuming process. Based on the feedback received from the 21 volunteers from Liceul "Petru Cercel" din Targoviste, that segmented the dataset, it took them on average 45 minutes to label an entire frame. This translates into 846 human hours needed to segment the manually labeled 1,047 frames.

A. Ruralscapes classes. Labels overlaid over RGB image with detail magnification, offering a good level of detail. Ruralscapes also offers large variation in object scale.

B. Ruralscapes statistics. (Left) Class pixels' distribution. Being a rural landscape, the dominant classes are buildings, land and forest. Due to high altitude, smaller classes such as haystack, car and person hold a very small percentage. (Right) Number of labeled images in which each class is present.

Experimental Analysis

  1. Comparisons to other label propagation methods

  • Qualitative results of our label propagation method. Our iterative SegProp method provides labels that are less noisy and more consistent for larger propagation distances. Also, by looking both forward and backward in time we can better handle occlusion: this is easily visible on the second row in the bottom of the image where forward camera movement obscures a bridge.

2. Semi-supervised learning with automatically generated labels

  • Qualitative results on the testing set. The results show that our proposed method leads to significantly more accurate segmentation in the semi-supervised scenario rather than the supervised case. SegProp clearly benefits the smaller, not well represented, classes such as person (third row).

3. Ablation studies: the effect of the propagation module

  • Homography propagation module

  • Other vote propagation modules

  • Influence of temporal propagation length

We run SegProp including other votes next to our optical flow based mappings, measuring mean F-measure over all classes. For the version with homography voting we also run the final filtering step. The bolded values are the best results.

A. Label propagation example showing typical optical flow voting difficulties. From left to right: RGB frame with manual white label overlaid, flow-based voting, homography-based voting.

B. The influence of increasingly larger temporal gaps between labeled frames over the segmentation performance (mean F-measure over all classes on a subset of videos labeled with a frequency of 25 frames).

Useful Links & Citation


If you intend to use our work please cite the following:

@article{marcu2020semantics,

title={Semantics through Time: Semi-supervised Segmentation of Aerial Videos with Iterative Label Propagation},

author={Marcu, Alina and Licaret, Vlad and Costea, Dragos and Leordeanu, Marius},

journal={arXiv preprint arXiv:2010.01910},

year={2020}

}

Useful Demos

SafeUAVNet trained with SegProp - Qualitative Results on Unseen Videos

TEST_unseen_videos.mp4

Annotation Tool Tutorial

annotation_tool_frameSeg.mp4

Acknowledgements

This work was funded by UEFISCDI, under Projects EEA-RO-2018-0496 and PN-III-P1-1.2-PCCDI-2017-0734.

We express our sincere thanks to Aurelian Marcu and The Center for Advanced Laser Technologies (CETAL) for providing access to their GPU computational resources.