Iterative Knowledge Exchange Between Deep Learning and Space-Time Spectral Clustering for Unsupervised Segmentation in Video

Abstract

We propose a dual system for unsupervised object segmentation in video, which brings together two modules with complementary properties: a space-time graph that discovers objects in videos and a deep network that learns powerful object features. The system uses an iterative knowledge exchange policy. A novel spectral space-time clustering process on the graph produces unsupervised segmentation masks passed to the network as pseudo-labels. The net learns to segment in single frames what the graph discovers in video and passes back to the graph strong image-level features that improve its node-level features in the next iteration. Knowledge is exchanged for several cycles until convergence. The graph has one node per each video pixel, but the object discovery is fast. It uses a novel power iteration algorithm computing the main space-time cluster as the principal eigenvector of a special Feature-Motion matrix without actually computing the matrix. The thorough experimental analysis validates our theoretical claims and proves the effectiveness of the cyclical knowledge exchange. We also perform experiments on the supervised scenario, incorporating features pretrained with human supervision. We achieve state-of-the-art level on unsupervised and supervised scenarios on four challenging datasets: DAVIS, SegTrack, YouTube-Objects, and DAVSOD.

Papers

E. Haller, A.M. Florea and M. Leordeanu. Iterative Knowledge Exchange Between Deep Learning and Space-time spectral clustering for unsupervised segmentation in video. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 40, No. 11, November 2022 [paper]

Our preliminary work is included in: E. Haller, A.M. Florea and M. Leordeanu. Spacetime Graph Optimization for Video Object Segmentation. arXiv 2019 [paper]

Overview of the full Iterative Knowledge Exchange (IKE) system, with the graph and the deep network modules that supervise each other over several cycles until reaching an equilibrium in the video segmentation output:

We propose a dual Iterative Knowledge Exchange (IKE) model coupling space-time spectral clustering with deep object segmentation, able to learn without any human annotation. The graph module exploits the spatio-temporal consistencies inherent in a video sequence but has no access to deep features. The deep segmentation module exploits high-level image representations but requires a supervisory signal. The two components are complementary and together form an efficient self-supervised system able to discover and learn to segment the primary object of a video sequence. The space-time graph is the first to play the teacher role by discovering objects unsupervised. Next, the deep model is trained from scratch for each video sequence, using the graph's output as pseudo-ground truth. At the second cycle, the graph module incorporates knowledge from the segmentation model, adding the powerful deep learned features that correspond to each of its nodes. The process in which the graph and the deep net exchange knowledge in this manner repeats over several cycles until convergence is reached. Below, we present the architecture of the proposed system, highlighting its two main modules: Graph Module and Network Module.

Qualitative Results

Results of our Unsupervised Iterative Knowledge Exchange system

Convergence and stability of Graph Module - Unsupervised setup

The Graph Module is the core of our solution, so we begin by illustrating its convergence process during the first cycle. We present results considering different initializations for node labels (random, uniform, gaussian or ground truth), experimentally proving that irrespective of the initialization, the method converges towards the same solution entirely defined by the Feature-Motion matrix (quantitative results of this experiment are presented in Fig. 6 of our original manuscript). This property of being able to converge to a stable solution, which naturally captures the main object in the scene, regardless of the initialization, is a key strength of our approach. It is entirely due to our unique Feature-Motion matrix formulation, which elegantly couples appearance and motion, offering the main object as its principal eigenvector in the space-time graph.

Convergence of IKE system - Unsupervised setup

We present qualitative results of the evolution over cycles of both Graph Module and Network Module in the unsupervised setup (no pretraining using humanly annotated data). The given examples highlight the complementarity of the two modules and their agreement at convergence. Both modules help each other from one iteration to the next. Once again, we believe our method has two main advantages over previous ones: 1) the mathematical formulation and solution in the Graph Module is stable, converges, and does not depend on initialization (given a certain Feature-Motion matrix); the object is naturally discovered as the principal eigenvector of the Feature-Motion matrix. 2) The Feature-Motion matrix depends on the quality of features; we provide a way for learning such powerful features, in a self-supervised manner, by using the Net module. Therefore, the novelty of the Graph-Net dual system is in its ability to learn by itself more powerful features, which are then used by the Graph at the next cycle.

Unsupervised vs. Supervised Setups

To highlight the power of our formulation, we provide a comparative analysis between our unsupervised IKE and IKE using pretrained features from an FCN backbone. We also present the segmentation maps generated by the FCN backbone to illustrate their limitations and highlight the advantages of our solution. Our self-supervised learned features follow the natural cluster can overcome heavily pretrained features.

in the first subsequence containing an unusual bird (the Vogelkop superb bird-of-paradise) and in the subsequence with a parachute jumper, we observe how the supervised backbone is unable to detect the object, while our solution, both unsupervised and supervised, successfully handles the situation
in sequences containing multiple objects, like the subsequence with breakdance or the subsequence with a camel, the supervised backbone extracts the masks of all the people/camels. With this prior, our supervised formulation is unable to delete the additional objects. We highlight that the unsupervised formulation works well in these scenarios and is not distracted by other objects

Spatio-Temporal consistency of IKE - Unsupervised Setup

Qualitative comparison of our unsupervised solution with two state-of-the-art unsupervised methods (ELM and FST) highlighting our solution's spatio-temporal cluster consistency.

our soft-segmentation masks are cleaner, containing fewer background pixels than other solutions, resulting from the exploitation of the spatio-temporal consistencies (e.g., parkour and horseman subsequences)
the segmentations of our IKE are more smooth and complete as a result of using the deep representation that is more locally consistent

Spatio-Temporal consistency of IKE - Supervised Setup

Qualitative comparison of our unsupervised solution with two state-of-the-art supervised methods (3DC-Seg and MATNet) highlighting our solution's spatio-temporal consistency. In the supervised formulation, IKE receives the features of 3DC-Seg thus, we can observe its power with respect to the considered baseline.

we observe how our solution can clean up some artifacts that are otherwise present in the 3DC-Seg (e.g., scooter subsequence)
similar to the unsupervised setup, IKE results are generally smoother and containing fewer background pixels

Fail cases: breaking the main assumptions of IKE- Unsupervised Setup

The development of our system starts from three main assumptions regarding the object of interest: 1) has a different motion pattern than the background scene; 2) has a different appearance, statistically, than the scene; 3) is the main element of the scene in the sense that it indeed forms the strongest space-time cluster in the scene. When these assumptions do not hold in practice, our system fails to correctly identify the object considered by human annotators as the object of interest. In this video, we illustrate such scenarios.

A. IKE assumes the existence of the main object, forming the main cluster of the Feature-Motion matrix. We illustrate scenarios when IKE focuses on multiple objects that share common motion and appearance patterns but are not all considered as main objects by the human annotators (e.g., the canal or the robot subsequences). We do not consider this as a true failure case, as our solution naturally finds those objects as the main cluster of the Feature-Motion matrix, although multiple strongly connected objects form this cluster. In cluttered, inconclusive scenes, the object selected is not necessarily the one indicated by the annotators (e.g., the carnival subsequence)
B. IKE assumes that the main object has a distinctive appearance and motion pattern. If this is not the case, it will be unable to identify the main object. In the provided examples, the main objects are similar in both motion and appearance with the background area. In both subsequences, the motion is induced by camera motion.
C. In IKE, the main object is the main cluster of the space-time graph, the one that stands out. In the case of static sequences, the motion contrast between object and background is missing (e.g., cow subsequence, where the object is almost static and far away from the camera). Also, there is not sufficient temporal change in the case of very short sequences to have a correct spatio-temporal clustering/separation from the background (e.g., subsequences with boat and cat)

Quantitative Results

Below, we present the performance evolution of the full Ierative Knowledge Excnahnge model. We follow both the Graph and Network Modules' evolution over several cycles. The Graph runs for an additional cycle to benefit from the best representation of the Network. a) the unsupervised setup b) the unsupervised setup without the Network Module c) the supervised setup over DeepLabv3 backbone. Even though the Network Module usually overcomes the Graph during the first cycle, the Graph Module exceeds the Network at convergence. The two modules' complementarity becomes evident when we consider the case without the Network (row b), with a huge performance drop compared to a). Even when starting from strong supervised features (row c), IKE still brings a significant performance boost.

Final performance of our unsupervised system, compared with top three unsupervised methods, on different datasets.

Final performance of our supervised system, compared with top three supervised methods, on different datasets.

Team

Emanuela Haller

PhD Student & Machine Learning Researcher

University Politehnica of Bucharest

Bitdefender

Emanuela Haller received her Bachelor's degree in Computer Science from University Politehnica of Bucharest and the Master's degree in Artificial Intelligence from the same institution. She has a strong background in mathematics and general computer science and currently focusing on fundamental research. Her current work is directed towards the unsupervised video sequences analysis, focusing on zero-shot video object segmentation task. She is currently a Ph.D. student at the University Politehnica of Bucharest and part of the Theoretical Research team at Bitdefender.

Prof. Dr. Adina Magda Florea

University Politehnica of Bucharest

Adina Magda Florea is Professor at the Department of Computer Science of University Politehnica of Bucharest and Head of the Artificial Intelligence and Multi-Agent Systems Laboratory (https://aimas.cs.pub.ro/). Her research interests are in multi-agent systems, machine learning, ambient intelligence, social robots and human-robot interaction. She is Senior Member of IEEE, Senior Member of ACM, and President of the Romanian Association for Artificial Intelligence.

Prof. Dr. Marius Leordeanu

University Politehnica of Bucharest

Institute of Mathematics of the Romanian Academy

Bitdefender

Marius Leordeanu is Associate Professor at the University Politehnica of Bucharest (UPB) and Senior Researcher at the Institute of Mathematics of the Romanian Academy (IMAR). Marius obtained his Bachelor's in Mathematics and Computer Science at Hunter College, City University of New York (2003) and PhD in Robotics at Carnegie Mellon University (2009). At UPB he introduced the graduate courses on computer vision and robotics and at IMAR he organizes an advanced computer vision reading group with weekly meetings. His current research spans different areas in vision and learning, with focus on unsupervised learning, the space-time domain, drones and aerial scene understanding, optimization on graphs and neural nets and relating vision and language. In 2020 Marius published a book, Unsupervised Learning in Space and Time (Springer), which pushes his research towards developing a more general model for unsupervised learning in space and time. For his work on unsupervised learning for graph matching, Marius received the "Grigore Moisil Prize" (2014), the top award at the intersection of Mathematics and Computer Science, given by the Romanian Academy.

Code

https://github.com/emanuelahaller/IKE

If you intend to use our work please cite the following:

@article{Haller2021IKE,

title={Iterative Knowledge Exchange Between Deep Learning and Space-Time Spectral Clustering for Unsupervised Segmentation in Videos},

author={Haller, Emanuela and Florea, Adina Magda and Leordeanu, Marius},

journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},

year={2021},

publisher={IEEE}

}

@article{Haller2019SpacetimeGO,

title={Spacetime Graph Optimization for Video Object Segmentation},

author={Emanuela Haller and A. Florea and M. Leordeanu},

journal={ArXiv},

year={2019},

volume={abs/1907.03326}

}

Acknowledgements

This work is funded in part by UEFISCDI, under Projects EEARO-2018-0496 and PN-III-P1-1.2-PCCDI-2017-0734.