Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus

Prof. Dr. Marius Leordeanu

Supervisor, Institute of Mathematics of the Romanian Academy & University Politehnica of Bucharest

Mihai Pirvu

PhD Candidate, Institute of Mathematics of the Romanian Academy

Dragos Costea

PhD Candidate, University Politehnica of Bucharest

Alina Marcu

PhD Candidate, Institute of Mathematics of the Romanian Academy

Prof. Dr. Emil Slusanschi

Supervisor, University Politehnica of Bucharest

Rahul Sukthankar

Distinguished Scientist/Director at Google Research

Accepted at the 35th AAAI Conference on Artificial Intelligence (AAAI 2021)

AAAI 2021 Paper (coming soon)

Abstract

We address the challenging problem of semi-supervised learning in the context of multiple visual interpretations of the world by finding consensus in a graph of neural networks. Each graph node is a scene interpretation layer, while each edge is a deep net that transforms one layer at one node into another from a different node. During the supervised phase edge networks are trained independently. During the next unsupervised stage edge nets are trained on the pseudo-ground truth provided by consensus among multiple paths that reach the nets' start and end nodes. These paths act as ensemble teachers for any given edge and strong consensus is used for high-confidence supervisory signal. The unsupervised learning process is repeated over several generations, in which each edge becomes a "student" and also part of different ensemble "teachers" for training other students. By optimizing such consensus between different paths, the graph reaches consistency and robustness over multiple interpretations and generations, in the face of unknown labels. We give theoretical justifications of the proposed idea and validate it on a large dataset. We show how prediction of different representations such as depth, semantic segmentation, surface normals and pose from RGB input could be effectively learned through self-supervised consensus in our graph. We also compare to state-of-the-art methods for multi-task and semi-supervised learning and show superior performance.

Idea

We propose the Neural Graph Consensus (NGC) model, a multi-class graph of deep neural networks, which approaches one of the most difficult problems in AI, that of unsupervised learning of multiple scene interpretations. We create a graph (or more generally, a hypergraph) of deep networks, such that each node in the graph represents a different interpretation layer of the world (e.g. semantic segmentation layer, depth layer, motion layer). The edges (or hyperedges) have dedicated deep networks, which predict the layer at one node from the layer, or layers, at one or several other nodes. We term the edges EdgeNets. The deep nets are trained in turn, using as supervision the consensual output of all paths reaching the same output node (when such consensus takes place). Since we do not have strict, well defined “worlds” as in reinforcement learning (RL), we let the natural structure in the data emerge, as in graph clustering. Similar to RL, in which agents in heavily constrained environments with well defined goals evolve by playing against each other, we put pressure on the deep nets by training them one versus the others, in complementary. Thus, each net brings something new, while also learning to agree with the other paths that reach the same output.

Unsupervised learning with NGC

We expect to have access to limited labeled data, for pretraining separately a good part of our nets. Once we connect them into a NGC graph, we can start the unsupervised learning phase. Let us consider unsupervised training of the net associated with the hyperedge having input clique (Lj, Lk ) and output node Lq (in red below). There could be many prediction paths (in green) going from other layers and sensors reaching Lq The idea is that we let prediction information flow through the green paths and use their consensual outputs as supervisory signal for the red network. In our experiments we consider the median output as consensus (in the case of regression) and majority voting (in the case of classification). Note that all layers involved that are not coming from sensors (e.g. Li,Lj, Lk) are established by the same consensus strategy as Lq. Unsupervised learning will take turns: each net becomes student of the NGC (hyper)graph and is trained by the mutual consensus coming from the contextual pathways that reach the same output node. NGC becomes a democratic self-supervised system for which agreement becomes the ultimate teacher in the face of unlabeled data.

Dataset

To test the NGC approach in the case of many scene representations we capture a large dataset in a virtual environment (that we will publicly release), in which a drone flies above a city and learns to predict from a single image the scene depth, the 3D surface normals (both from the world and camera system of reference), its absolute location and orientation (6D pose), the scene wireframe (object boundaries) as well as the semantic segmentation of the scene in 12 classes: building, fence, pedestrian, pole, road line, road, sidewalk, vegetation, vehicle, wall, traffic sign, other (Figure below). The dataset is divided in four: supervised training set (subdivided in 8k images for training and 2k for validation), 2 test sets (10k images each, for unsupervised learning iterations 1 and 2) and a separate evaluation set (10k images, never seen during learning).

Quantitative results

Qualitative results

The plots show performance improvement (green) vs performance degradation (red) on five tasks. We compare the original predictions with two graph iterations and their distilled EdgeNets.

Useful Links

Demo (video)

While our method is based on processing single frames (independently), we present the results in form of a continuous video in order to better assess the quality of the approach in the temporal domain, even though no temporal information was used by our specific NGC implementation - each frame is processed independently of the others.

The camera follows a part of the evaluation set trajectory -- approximately 20% of the total images. This continuous flight path was never seen during training or semi-supervised learning. We provide snippets of five different tasks: depth, semantic segmentation, surface normals with reference to the camera (C), surface normals with reference to the world (W) and wireframe.


Cite

If you intend to use our work please cite the following:

@article{leordeanu2020semi,

title={Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus},

author={Leordeanu, Marius and Pirvu, Mihai and Costea, Dragos and Marcu, Alina and Slusanschi, Emil and Sukthankar, Rahul},

journal={arXiv preprint arXiv:2010.01086},

year={2020}

}

Acknowledgements

This work was supported by UEFISCDI, under Projects EEA-RO-2018-0496 and PN-III-P1-1.2-PCCDI-2017-0734.

We would like to express our gratitude to Aurelian Marcu and The Center for Advanced Laser Technologies (CETAL) for providing us GPU training resources.