A Multi-Stage Multi-Task NN for Aerial Scene Interpretation and Localization

Team members

Alina Marcu, Dragos Costea

Supervisors

Prof. dr. Marius Leordeanu

Prof. dr. Emil Slusanschi


Idea

A multi-stage multi-task neural network that is able to handle segmentation and localization at the same time, in a single forward pass.

Architecture

Stage 1 is designed for semantic segmentation

  • our network predicts pixelwise class labels
  • roads can be used as a unique footprint of an urban area
    • we train MSMT-Stage-1 for road detection

Stage 2 provides a precise location using two branches

  • one branch uses a regression network
  • the other is used to predict a location map trained as a segmentation task


LocDecoder-R-2 predicts location as two real valued numbers for longitude and latitude.

LocDecoder-S-128 predicts a localization map of size 128x128 on the whole area of possible locations. White pixels denote probable locations of the input image.


Localization dataset

We collected 9531 512x512 pixel images randomly chosen within a 100x100 squared meters area around any intersection, covering in total an European urban area of around 70 squared kilometers.

Each grey disk in the figure depicts a region of 500 meters radius around the training (blue centers) and testing (red centers) data.

Localization and alignment

The localization network predicts a dot, we extract the roads from that location and match against the roads from OSM. The aligned roads generate an offset from the OSM roads and provide the final localtion, as shown below.

Unfortunately, the OSM roads never match perfectly to the real 'ground truth', affecting localization performance.

Segmentation accuracy

For the task of semantic segmentation, we report state-of-the-art results on the publicly available Inria dataset, using our MSMT-Stage-1 network.

Qualitative results shown below:

RGB input image MSMT-Stage-1 prediction Ground truth

Quantitative results shown below:

Localization accuracy

  • 96.84% of test locations have an error of less than 20m without alignment
  • 94.56% of the test locations are within 2.5m of the ground truth location
    • 97.58% are within 5 meters, which matches an approximate figure for a commercial GPS

Detalied error comparison depicted below

Links

ArXiv paper:

Full paper

Cite

Marcu, Alina, et al. "A Multi-Stage Multi-Task Neural Network for Aerial Scene Interpretation and Geolocalization." arXiv preprint arXiv:1804.01322 (2018).

@article{marcu2018multi,
  title={A Multi-Stage Multi-Task Neural Network for Aerial Scene Interpretation and Geolocalization},
  author={Marcu, Alina and Costea, Dragos and Slusanschi, Emil and Leordeanu, Marius},
  journal={arXiv preprint arXiv:1804.01322},
  year={2018}
}