Dual Local - Global Contextual Pathways for Recognition in Aerial Imagery

Team members

Alina Marcu

Supervisors

Prof. dr. Marius Leordeanu

Overview

We study the importance of visual context for the task of object detection in aerial images, also highlighting the great challenges this problem poses.

Aerial images are often taken under poor lighting conditions and contain low resolution objects, many times occluded by trees or taller buildings. In this domain, in particular, spatial contextual information could be of great help.

We introduce context as a complementary way of recognizing objects. We propose a dual - stream deep convolutional neural network that processes information along two independent pathways, one for local and another for global visual reasoning. Our model learns to combine local object appearance as well as information from the larger scene in a complementary way, such that together they form a powerful classifier.

We test our dual - stream network on the task of segmentation of buildings and roads in aerial images. While our local - global model could also be useful in general recognition tasks, we clearly demonstrate the effectiveness of visual context in conjunction with deep nets for aerial image understanding.

Main Contributions

We studied the relative importance of local appearance versus the larger scene, as well as their performance in combination.
We propose a novel dual - stream deep CNN architecture, with two independent processing pathways, one for local and the other for global image interpretation and demonstrate the importance of the larger visual scene context since current techniques in aerial imagery focus only on local object appearance.
We show in our experiments that the two pathways learn to process information complementarily in order to obtain an improved output.
Experimentally, we also demonstrate the relevance of contextual information for semantic segmentation in aerial images and show superior performance to current state-of-the-art methods on the publicly available Massachusetts Buildings Dataset.

Preliminary Work and Intuition

A: Local appearance is often not sufficiently informative for segmentation in low - resolution aerial images. The larger context provides vital information even for highly localized tasks such as fine object segmentation. The exact shape of the house in the example on the right is better perceived when looking at the larger residential area, which contains other houses of similar shapes and orientations. Thus, local structure could be better interpreted in the context of the larger scene.

B: Our model for residential area detection has poor localization but low false positive rate within the larger neighbourhood. The model can be effectively combined, in a simple classification tree, with the local semantic segmentation model, which has higher localization accuracy but relatively high false positive rate. Our intuition is that combining the two architectures, the residential area detection will filter out the houses hallucinated by the local model, while still maintaining the shape of the building.

A Dual Local - Global CNN for Semantic Segmentation

Our proposed, dual - stream, local - global architecture LG - Seg Net. Built by modifying and joining two state-of-the-art deep neural networks, namely VGG-Net [1] - used here for local image interpretation (L - Seg) and AlexNet [2] - used here for global interpretation of the contextual scene (G - Seg). Note that the L - Seg network is deeper but narrower with smaller filter sizes (and smaller input in our case) and it is thus better suited for more detailed local processing. G - Seg network, which is shallower (fewer layers) but wider (larger input and filters), takes into consideration more information at once and it is thus more appropriate for global processing of larger areas. The two pathways are joined in the final fully - connected layers, which combine information about object and context into a unified and balanced higher level image interpretation.

Results

We perform experiments on buildings and roads segmentation on three datasets from different regions in the world: USA, Western Europe and Romania. These datasets vary greatly in terms of quality and content.

Detection of Massachusetts Buildings

Qualitative buildings detection results on the Massachusetts Dataset. Note the high level of regularity of buildings and roads, which look very similar to each other. This allows our model to learn almost perfectly and match human performance.

Quantitative results on this dataset are provided below:

Detection of European Buildings

Qualitative comparison between the local L - Seg, global G - Seg and local-global LG - Seg architectures. LG - Seg performs the best. By reasoning over a larger area LG - Seg is able to remove false positive responses. Note that LG - Seg is also able to produce more accurate building shapes.

Mean F-measure results of our models on the European Buildings Dataset:

LG - Seg is superior, with over 1.5% improvement in F-measure, on average, over L - Seg. The improvement is significant especially in regions of low residential density where the local model tends to hallucinate buildings. G - Seg does poorly by itself as it cannot capture fine segmentation details, but it becomes valuable, as a scene processing pathway, within the LG - Seg framework. By reasoning over a larger area LG - Seg is able to remove false positives and is also able to produce more accurate building shapes. We stress out that the qualitative difference between the local - global approach and the single deep net is clearly visible on the output map in non-residential areas where the single net hallucinates houses. As these structures are very small, the false positives do not affect the average F-measure by a large value, numerically. Thus the 1.5 - 2% quality difference is significant in aerial imagery where the positive structures are relatively very small.

Detection of Romanian Roads

This dataset offers a different task, that of road detection, and also a much more challenging one due to limitations and variations in the data. Different from the other image sets, this one is of significantly lower quality, with large variations in the road structure, their type, width and length. Moreover, often the roads are completely occluded by trees and the ground truth road maps do not match correctly what is seen in the image. For these many reasons, on this dataset, the problem of recognition is tremendously difficult and pushes the limits of deep learning to a next level, as reflected by the significantly lower performance.

Counting Romanian Houses

An obvious application of building detection that is also useful in applications such as real estate and cadaster mapping, urban planning and landscape monitoring, is the detection and counting of houses within a given area.

Local - Global Complementarity

We designed a set of experiments in order to better understand the role of each subnet. After training the full LG - Seg model, we performed the following: first, we ran the model over the test images by providing the local pathway with the correct image input, but giving a blank image to the global pathway. The blank image was the average of the original input image, for each RGB channel averaged separately. Then, we performed the opposite experiment and switched the inputs, by giving the original image to the global subnet and blank images to the local one. The idea was to see how, in the fully trained model, each path contributes to the final decision.

In these experiments we aim to find what the two pathways have learned. The second column shows results when only the global pathway is fed with real image signal, the other being given blank image as input. The third column shows the opposite case, when only the local pathway is given real information. The fourth column presents the output of the network running normally, with both pathways having image input. Note that the global subnet learns to detect residential areas similar to our initial classifier for such regions. Also, the residential area segmentation produced by the LG - Seg is superior to the one produced by our initial residential detector classifier, even though in the case of the LG - Seg it was not asked to learn about residential areas. The local pathways on the other hand focuses only on small, detailed structures. The imbalance between the energy levels of the outputs is due to the fact that one of the inputs is blank, thus unbalancing the way energy flows at the highest fully-connected layers. The results also suggest that the two pathways have roles of both reinforcement and inhibition. For example, the local pathway will inhibit the global positive outputs in spaces between buildings, whereas the global pathway will inhibit the local hallucinations in areas of low residential density. We can safely conclude that the two pathways work in complementarity.

References

[1] - Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

[2] - Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[3] - Mnih, Volodymyr. Machine learning for aerial image labeling. Diss. University of Toronto, 2013.

[4] - Saito, Shunta, and Yoshimitsu Aoki. "Building and road detection from large aerial imagery." SPIE/IS&T Electronic Imaging. International Society for Optics and Photonics, 2015.

Paper

More details about our paper can be found here.

Code

Our code and models will soon be publicly available.

Datasets

We provide the links to the datasets used in our experiments. Each dataset is divided in train, valid and test sets and contain the RGB satellite images, along with their corresponding pixel-wise ground truth maps.

European Buildings Dataset | Romanian Buildings Dataset | Romanian Roads Dataset | Massachusetts Buildings Dataset

Cite

Marcu, Alina, AND Leordeanu, Marius. "Object Contra Context: Dual Local-Global Semantic Segmentation in Aerial Images" AAAI Workshops (2017): n. pag. Web. 3 Dec. 2018

@article{marcu2016dual,

  title={Dual local-global contextual pathways for recognition in aerial imagery},

  author={Marcu, Alina and Leordeanu, Marius},

  journal={arXiv preprint arXiv:1605.05462},

  year={2016}