Towards Automatic Annotation for Semantic Segmentation in Drone Videos

Alina Marcu

PhD Candidate, Institute of Mathematics of the Romanian Academy

Dragos Costea

PhD Candidate, University Politehnica of Bucharest

Vlad Licaret

AI Engineer

Prof. Dr. Marius Leordeanu

Supervisor, Institute of Mathematics of the Romanian Academy & University Politehnica of Bucharest

Abstract

Semantic segmentation is a crucial task for robot navigation and safety. However, it requires huge amounts of pixelwise annotations to yield accurate results. While recent progress in computer vision algorithms has been heavily boosted by large ground-level datasets, the labeling time has hampered progress in low altitude UAV applications, mostly due to the difficulty imposed by large object scales and pose variations. Motivated by the lack of a large video aerial dataset, we introduce a new one, with high resolution (4K) images and manually-annotated dense labels every 50 frames. To help the video labeling process, we make an important step towards automatic annotation and propose SegProp, an iterative flow-based method with geometric constrains to propagate the semantic labels to frames that lack human annotations. This results in a dataset with more than 50k annotated frames - the largest of its kind, to the best of our knowledge. Our experiments show that SegProp surpasses current state-of-the-art label propagation methods by a significant margin. Furthermore, when training a semantic segmentation deep net using the automatically annotated frames, we obtain a compelling overall performance boost at test time of 16.8% mean F-measure over a baseline trained only with manually-labeled frames.

Idea & Pipeline

In this paper we introduce Ruralscapes, the largest high resolution (4K) video dataset for aerial semantic segmentation, taken in flight over rural areas in Eastern Europe. Then we start from a relatively small subset of humanly labeled frames in a video and perform SegProp, our novel iterative label propagation algorithm, to automatically annotate the whole sequence. Given a start and an end frame of a video sequence, SegProp finds pixelwise correspondences between labeled and unlabeled frames, to assign a class for each pixel in the video based on an iterative class voting procedure. In this way we generate huge amounts of labeled data (over 50k segmented frames) to use in training deep neural networks and show that the automatically labeled training frames help significantly in boosting the performance at test time.

Our pipeline can be divided into three steps. The first and most important is the data labeling step. We leverage the advantages of high quality 4K aerial videos, such as small frame-to-frame changes (50 frames per second) and manually annotate a relatively small fraction of frames, sampled at 1 frame per second. Then, we automatically generate a label for each intermediate frame between two labeled ones, using the SegProp algorithm. As final step, we mix the manually and automatically annotated frames and use them for training.

Main contributions:

We introduce Ruralscapes the largest high resolution (4K) video dataset for aerial semantic segmentation composed of 50,835 fully annotated frames with 12 semantic classes.
We propose an iterative, optical flow based label propagation method, termed SegProp, with geometric constraints, that outperforms similar state-of-the-art algorithms.
We show that our method can easily integrate other similar label propagation methods in order to further improve the segmentation results.

Ruralscapes: A Dataset for Rural UAV Scene Understanding with Large Altitude Changes

We designed a user-friendly tool that facilitates drawing the contour of objects (in the form of polygons). For each selected polygon we can assign one of the 12 available classes. The class set includes background objects such as forest, land, hill, sky, residential, road or river, and also, some foreground, countable objects, like person, church, haystack, fence and car.

We have collected 20 high quality 4K videos portraying rural areas. Ruralscapes comprises of various landscapes, different flying scenarios at multiple altitudes and objects across a wide span of scales. The video sequence length varies from 11 seconds up to 2 minutes and 45 seconds. The dataset consists of 17 minutes of drone flight, resulting in 50,835 fully annotated frames with 12 classes. Of those, 1,047 were manually annotated. To the best of our knowledge, it is the largest dataset for semantic segmentation from real UAV videos.

Sample label image overlaid on top of its corresponding RGB image with detail magnification. Small classes such as haystack and car are difficult to segment accurately, but overall the labeled frames contain a very good level of detail. The dataset offers a large variation in object scale: classes generally easy to segment up close such as buildings turn into difficult classes far away from the camera.

Class pixels' distribution. Being a rural landscape, the dominant classes are buildings, land and forest (73.01% combined). Due to the flight altitude, smaller classes such as haystack, car and person hold a very small percentage. Nevertheless, this distribution helps common UAV tasks such as mapping, navigation with obstacle avoidance and safe landing or more complex applications such as package delivery.

Proposed Automatic Label Propagation

Overview of the proposed method for automatic propagation of semantic labels in the context of aerial semantic segmentation

A. The UAV videos are sampled at one frame per second and the resulting frames are manually labeled.

B. The labels were propagated to the remaining frames using our SegProp algorithm, based on class voting at the pixel level according to (1) forward and backward flow from the current frame to a manually annotated frame (2) region-based homography maps computed between current and manually labeled frames and (3) iterations of 1 and 2 among neighboring frames.

C. All frames were used to train a UNet-like CNN with dilated convolutions from our previous work.

Label propagation results

RGB frame with manual white label overlaid, flow-based voting only, homography-based voting only, and full flow and homography combined voting propagation. While the homography based voting produces "cleaner" semantic regions, an agreement between optical flow and homography is desirable.

Experimental Analysis

The above picture depicts our qualitative results on the testing set. SegProp helps both small classes (person, haystack) as well as large classes (an example above is the sky and forest from the second row and the land in the background of the third row. Thus, not only the small classes are better represented, but the large ones also benefit from a more spatially coherent detection - e.g., the grass close to the humans in the third row.

Links

ArXiv Paper

Code

Coming Soon

Ruralscapes Dataset

Our Segmentation Tool

Coming Soon

Demo (video)

demo_final_icra_2020.mp4

Cite

If you intend to use our work please cite the following:

@misc{marcu2019automatic,

    title={Towards Automatic Annotation for Semantic Segmentation in Drone Videos},

    author={Alina Marcu and Dragos Costea and Vlad Licaret and Marius Leordeanu},

    year={2019},

    eprint={1910.10026},

    archivePrefix={arXiv},

    primaryClass={cs.CV}

Acknowledgements

This work was supported by UEFISCDI, under Projects EEA-RO-2018-0496 and PN-III-P1-1.2-PCCDI-2017-0734.

Google Sites

Report abuse