HyperVision

Unsupervised Visual Learning by Intelligent Equilibrium in

Hypergraphs of Neural Networks

What is HyperVision?

Learning with minimal human supervision has enormous practical advantages with strong impact in many sectors of human activity. While deep nets produced great advances in all corners of AI, they are strongly dependent on supervised learning. The more classical graph models are much better suited for the unsupervised case. Most such tasks are deeply rooted in clustering which can naturally be formulated as graph partitioning. The HyperVision project aims to bring the powers of graphical models and deep networks to tackle together, in a natural combination, the unsupervised learning task. In a nutshell, our three main objectives are:

Objective 1: unsupervised discovery and segmentation of primary objects in video by considering their consistency, in terms of movement and appearance in space and time. Develop powerful unsupervised space-time graph clustering algorithms and enhance them with the power of deep neural nets.

Objective 2: move towards unsupervised learning of full scene semantic segmentation in combination with other tasks such as the prediction of depth, 3D structure, motion and pose.

Objective 3: develop the final HyperVision model based on self-supervised learning by intelligent equilibrium in a multi-task hypergraph of neural nets.

Key idea behind HyperVision

HyperVision puts together different interpretation layers of the scene, such as 3D structure, motion, semantic segmentation of objects and activities in space and time, into a unified hypergraph, in which a single hyper-edge, modeled by a deep net, is trained unsupervised on the consensual output of the other neural pathways reaching the same output layer. The system will reach a state of intelligent equilibrium and become able to learn from unsupervised data.

Results of the HyperVision Project

Realized Objective 1: Unsupervised discovery and segmentation of primary objects in video by considering their consistency, in terms of movement and appearance, in space and time. Develop powerful unsupervised space-time graph clustering algorithms and enhance them with the power of deep neural nets. In this final scientific report, we present our novel approach for the first Objective, which was published in TPAMI (Haller et al, TPAMI 2022). The TPAMI article is a vast piece of work, with many theoretical results and experimental validation and we consider it covers fully the goals of the first objective. Other results we have recently published (Marcu et al ICCV 2023, Pirvu et al ICCV 2023) also cover the aspect of unsupervised learning, but not from the perspective of the discovery of a single primary object in video, but from the larger perspective of the semantic segmentation of many classes, as well as other tasks (e.g. depth and surface normal estimation). Thus, our work (Marcu et al ICCV 2023, Pirvu et al ICCV 2023) better relates to the goals set for the next Objective 2.

Realized Objective 2: Move towards unsupervised learning of full scene segmentation and understanding in combination with other tasks such as the prediction of depth, 3D structure, motion and pose. Regarding the task of semantic segmentation with minimal human annotation, we have already published, as part of the project, two papers (Leordeanu et al, AAAI 2021 and Haller et al, BMVC 2021), in which we present how the second objective can be achieved based on two different novel versions of our self-supervised multi-task hypergraph consensus concept. In these papers we also show how the same key idea is also effective for learning to predict with minimal supervision depth, motion and 3D pose. Moreover, in two recent papers (Pirvu et al CVPR 2021 and Licaret et al, ICRA 2022), we proposeeffective methods for unsupervised learning of monocular metric depth estimation, which run in near real-time and can also be deployed on embedded devices, such as Unmanned Aerial Vehicles (UAVs).

However, our latest papers published (Bicsi et al ICCV 2023 – Oral presentation, Marcu et al ICCV 2023, Pirvu et al ICCV 2023) come to conclude our full and final Self-Supervised Multi-Hypergraph HyperVision Model in different contexts such as that of observing the world from the air (using UAVs), from satellites (using data from NASA) and using general videos from multiple datasets for activity recognition.

Realized Objective 3: Develop the final HyperVision model based on self-supervised learning by intelligent equilibrium in a multi-task hypergraph of neural nets. Regarding the last and most general objective of the HyperVision project, the latest papers published (Bicsi et al ICCV 2023 – Oral presentation, Marcu et al ICCV 2023, Pirvu et al ICCV 2023) constitute the development of the full multi-task hypergraph, with unsupervised learning capabilities using consensus among multiple pathways, as teachers for the next generation single student networks. We have tested this general model in different contexts such as: full scene understanding from drones (Marcu et al ICCV 2023), for which we introduced DroneScapes, a novel dataset that covers many different scenes from Romania and one from Norway; better understanding the Earth Climate and Observations (Pirvu et al ICCV 2023) from many layers (over 20) of satellite data collected by NASA, as well as applying the self-supervision consensus in the case of multi-datasets for activity recognition in videos (Bicsi et al ICCV 2023).

These final results are based on our previous models and works published in the previous years, also as part of the HyperVision project (Leordeanu et al, AAAI 2021 and Haller et al, BMVC 2021). They were the fundamental steps necessary for ultimately achieving our final Self-supervised Hypergraph model. Additionally, we have also proposed a novel Graph Neural Network (GNN) model, recurrent in space and time, recently published (Duta et al, NeurIPS 2021), in which we show how the nodes of the graph learn by themselves, unsupervised, to attach to different salient entities in the video, while the only supervision signal, during learning, is the semantic class of the video.

Final remarks: Our final three publications at ICCV 2023 along with the other works published in prestigious jorunals (TPAMI 2022) and top international conferences (AAAI 2021, BMVC 2021, NeurIPS 2021, CVPR 2021, ICRA 2022) demonstrate that all the main objectives of the project have been successfully met. One line of research that is worth exploring more in the future is along the temporal dimension, starting from the results we demonstrated in the TPAMI 2022 journal paper – where the time dimension (which brings coherence in space over time) is a crucial component of the fully unsupervised discovery of objects. Then, our latest paper (Masala et al, ICCV 2023) moves towards the vision and language domain, for which this current HyperVision project is a strong basis, Our next step will be in using the state of the art Large Language Models and our own multi-task graph and hypergraph approaches for better connecting the fields of vision and language understanding.

Scientific impact of the project results

The expected impact of the HyperVision project in the medium and long terms is expected to be strong, at a top international level, given the number of scientific papers published at this level and also the presentations given as part of summer schools, conferences and workshops organized. Therefore, we see the impact of the project along two main directions, one of publications of articles and conference papers and the other of dissemination, through such presentations at top international venues:

Impact through Publications

The scientific efforts during the three years of the HyperVision project resulted in 10 publications (1 high impact journal – TPAMI 2022 and 9 papers in top level conferences and their workshops – AAAI 2021, BMVC 2021, CVPR-Workshops 2021, NeurIPS 2021, ICRA 2022, and 3 papers in ICCV- Workshops 2023). Please note that all these conferences and journal are the very top in their fields, such that the workshop series are considered Rank A publications, while the main conference ones are of Rank A*.

Main result: Probably, given the prestige and very high impact factor, the best of the HyperVision project is the publication of the TPAMI 2022 article (Haller, Florea and Leordeanu, TPAMI 2022). However, considering the HyperVision model as a whole, a unique model in the literature with self-supervised multi-task hypergraph structure, equally important are the group of papers (Leordeanu at el, AAAI 2021, Haller et al, BMVC 2021 and the three Workshop papers Marcu et al, ICCV 2023, Pirvu et al, ICCV 2023 and Bicsi et al, ICCV 2023), which represent the evolution of the HyperVision idea from the initial concept to the full and more applied model, in various real world applications.

Impact through Dissemination

We disseminated the results of the project in various presentations at the international and national levels, which received a very strong, positive feedback from the artificial intelligence community, and we believe will make a long-lasting effect, by drawing attention on our work, directly, through face-to-face meetings and presentations, beyond the standard communication through publication of papers. Thus, we would mention the following dissemination activities which had a particularly high impact:

1) Organization of workshops at prestigious conferences: Embedded Vision Workshop (EVW) 2023, 2022 and 2021 in conjunction with the IEEE international conference on Computer Vision and Pattern Recognition (CVPR – Rank A*), where the PI Marius Leordeanu was Co-chair during all three years, for the full duration of the project. As part of prestigious EVW workshop, with a tradition of 19 editions, we discussed and presented extensively the ideas and results of the HyperVision project.

(Workshop webpage: https://embeddedvisionworkshop.wordpress.com/)

2) Organization of autumn school Machine Learning and Vision for Industrial Applications (MALVIC), Kristiansand, Norway, 2021, where the project PI gave a 2 hours lecture on the concepts and results of the research directly related to the project and its publications.

3) Oral presentation of paper (Bicsi, Alexe, Ionescu and Leordeanu, ICCV 2023) at the International Conference on Computer Vision 2023, as part of the Representation Learning with Very Limited Images (LIMIT) Workshop (in Paris) by the PI Dr. Marius Leordeanu – a presentation that received very positive feedback, with an interesting Q&A session that followed (https://lsfsl.net/limit23/).

4) Oral and poster presentations by the PI Dr. Marius Leordeanu and the doctoral students involved in the research that is directly related to HyperVision, at the Romanian AI Days 2022 (in Brasov), a prestigious event at the national level that brings together prominent researchers from Romania and abroad (https://days.airomania.eu/).

5) Oral presentation and co-organization by PI Dr. Marius Leordeanu, as part of the Trust-AI Exploratory Workshop of the Smart Diaspora 2023 conference in Timisioara, a prestigious event at the national level that brought together Romanian researchers from all over the world.

HyperVision Team

Principal Investigator

Prof. Dr. Marius Leordeanu

leordeanu@gmail.com and marius.leordeanu@upb.ro

Phone: +40746 033 711

Professor of Computer Science at Politehnica University of Bucharest

Senior Researcher at Institute of Mathematics of the Romanian Academy

Doctoral Students and Junior Researchers

Alina Marcu

alymarcu91@gmail.com

Alexandra Budisteanu

budisteanu.alexandra@yahoo.com