Selected Research Projects

From the Romanian Research Laboratory to the Global Industry

Presenting our work together with Fordaq-Romania at the Romanian Parliament, in front of distinguished members of the government (Ministry of Research. Ministry of Education, Ministry of Defense) and academia (top representatives of the most prestigious universities in the country) on developing the first applications in the world for detecting and measuring lumber in realtime (TallyExpress) as well as determining the quality and defects of timber (Neural Grader) using our novel patented artificial intelligence algorithms.

A short introductory demo of TallyExpress, which currently has over 150 clients in United States, is presented on the left.

Published US Patents

Leordeanu, Marius, Vlad Licaret, Tudor Buzu, Iulia Muntianu, Cătălin Mutu. "Automatic detection, counting, and measurement of lumber boards using a handheld device." U.S. Patent 10,586,321, 2020.


Leordeanu, Marius, Iulia-Adriana Muntianu, Dragos Cristian Costea, and Cătălin Mutu."Automatic detection, counting, and measurement of logs using a handheld device." U.S. Patent 11,189,022. 2021


Leordeanu, Marius, Alina Elena Marcu, Iulia Muntianu, and Cătălin Mutu. "Automatic detection, counting, and measurement of lumber boards using a handheld device." U.S. Patent 11,216,905. 2022


Research and Development Project (won by competition):


Neural Grader - Automatic system for semantic analysis and grading of wood in images using efficient computational vision methods and deep convolutional neural networks. Project won from European Funds, POC/524/2/2 (1.2 Million Euros)


We address the challenging problem of semi-supervised learning in the context of multiple visual interpretations of the world by finding consensus in a graph of neural networks. Each graph node is a scene interpretation layer, while each edge is a deep net that transforms one layer at one node into another from a different node. During the supervised phase edge networks are trained independently. During the next unsupervised stage edge nets are trained on the pseudo-ground truth provided by consensus among multiple paths that reach the nets' start and end nodes. These paths act as ensemble teachers for any given edge and strong consensus is used for high-confidence supervisory signal. The unsupervised learning process is repeated over several generations, in which each edge becomes a "student" and also part of different ensemble "teachers" for training other students. By optimizing such consensus between different paths, the graph reaches consistency and robustness over multiple interpretations and generations, in the face of unknown labels. We give theoretical justifications of the proposed idea and validate it on a large dataset. We show how prediction of different representations such as depth, semantic segmentation, surface normals and pose from RGB input could be effectively learned through self-supervised consensus in our graph. We also compare to state-of-the-art methods for multi-task and semi-supervised learning and show superior performance.

Papers

Marius Leordeanu, Mihai Pirvu, Dragos Costea, Alina Marcu, Emil Slusanschi, Rahul Sukthankar, Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus, Accepted at AAAI Conference on Artificial Intelligence (AAAI) 2021 View

For more information and results please visit our Project Website.


At the top we present the convergence of the spectral space-time clustering in the graph module at the first iteration. At the bottom we present some final results of the graph and the network modules, after several cycles of self-supervised training, in comparison to the ground truth.



We propose a dual system for unsupervised object segmentation in video, which brings together two modules with complementary properties: a space-time graph that discovers objects in videos and a deep network that learns powerful object features. The system uses an iterative knowledge exchange policy. A novel spectral space-time clustering process on the graph produces unsupervised segmentation masks passed to the network as pseudo-labels. The net learns to segment in single frames what the graph discovers in video and passes back to the graph strong image-level features that improve its node-level features in the next iteration. Knowledge is exchanged for several cycles until convergence. The graph has one node per each video pixel, but the object discovery is fast. It uses a novel power iteration algorithm computing the main space-time cluster as the principal eigenvector of a special Feature-Motion matrix without actually computing the matrix. The thorough experimental analysis validates our theoretical claims and proves the effectiveness of the cyclical knowledge exchange. We also perform experiments on the supervised scenario, incorporating features pretrained with human supervision. We achieve state-of-the-art level on unsupervised and supervised scenarios on four challenging datasets: DAVIS, SegTrack, YouTube-Objects, and DAVSOD.

Please see our demos on the left. For more informatio please visit our Project Website.

Papers

Haller, Emanuela, Adina Magda Florea, and Marius Leordeanu. "Iterative Knowledge Exchange Between Deep Learning and Space-Time Spectral Clustering for Unsupervised Segmentation in Videos." arXiv preprint arXiv:2012.07123 (2020). View

A hierarchical approach for vision to language generation

Automatically describing videos in natural language is an ambitious problem, which could bridge our understanding of vision and language. We propose a hierarchical approach, by first generating video descriptions as sequences of simple sentences, followed at the next level by a more complex and fluent description in natural language. While the simple sentences describe simple actions in the form of (subject, verb, object), the second-level paragraph descriptions, indirectly using information from the first-level description, presents the visual content in a more compact, coherent and semantically rich manner.

To this end, we introduce the first video dataset in the literature that is annotated with captions at two levels of linguistic complexity. We perform extensive tests that demonstrate that our hierarchical linguistic representation, from simple to complex language, allows us to train a two-stage network that is able to generate significantly more complex paragraphs than current one-stage approaches.

Publications:

Vlad Bogolin, Ioana Croitoru and Marius Leordeanu, A hierarchical approach to automatic vision to language generation: from simple sentences to complex natural language, International Conference on Computational Linguistics (COLING), 2020

Semi-supervised Semantic Segmentation of Aerial Videos

Semantic segmentation is a crucial task for robot navigation and safety. However, current supervised methods require a large amount of pixelwise annotations to yield accurate results. Labeling is a tedious and time consuming process that has hampered progress in low altitude UAV applications. This paper makes an important step towards automatic annotation by introducing SegProp, a novel iterative flow-based method, with a direct connection to spectral clustering in space and time, to propagate the semantic labels to frames that lack human annotations. The labels are further used in semi-supervised learning scenarios. Motivated by the lack of a large video aerial dataset, we also introduce Ruralscapes, a new dataset with high resolution (4K) images and manually-annotated dense labels every 50 frames - the largest of its kind, to the best of our knowledge.

Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90\% (F-measure), significantly outperforming other state-of-the-art label propagation methods. Moreover, when integrating other methods as modules inside SegProp's iterative label propagation loop, we achieve a significant boost over the baseline labels. Finally, we test SegProp in a full semi-supervised setting: we train several state-of-the-art deep neural networks on the SegProp-automatically-labeled training frames and test them on completely novel videos. We convincingly demonstrate, every time, a significant improvement over the supervised scenario.

More information can be found on the project website here.

Publications:

Alina Marcu, Vlad Licaret, Dragos Costea and Marius Leordeanu, Semantics through Time: Semi-supervised Segmentation of Aerial Videos with Iterative Label Propagation, Asian Conference on Computer Vision (ACCV) 2020

Spectral Object Segmentation in Space and Time with Fast 3D Convolutions

We formulate object segmentation in video as a spectral graph clustering problem in space and time, in which nodes are pixels and their relations form local neighbourhoods. We claim that the strongest cluster in this pixel-level graph represents the salient object segmentation. We compute the main cluster using a novel and fast 3D filtering technique that finds the spectral clustering solution, namely the principal eigenvector of the graph's adjacency matrix, without building the matrix explicitly - which would be intractable. Our method is based on the power iteration which we prove is equivalent to performing a specific set of 3D convolutions in the space-time feature volume. This allows to avoid creating the matrix and have a fast parallel implementation on GPU. We show that our method is much faster than classical power iteration applied directly on the adjacency matrix. Different from other works, ours is dedicated to preserving object consistency in space and time at the level of pixels. In experiments, we obtain consistent improvement over the top state of the art methods on DAVIS-2016 dataset. We also achieve top results on the well-known SegTrackv2 dataset.

Publications:

Elena Burceanu and Marius Leordeanu, A 3D Convolutional Approach to Spectral Object Segmentation in Space and Time, International Joint Conferences on Artificial Intelligence (IJCAI), 2020


Improving automatic visual recognition with EEG signals

Classifying visual information is an apparently simple and effortless task in our everyday routine, but can we automatically predict what we see from signals emitted by the brain?

While other researchers have already attempted to answer this question, we are the first to show that a commercially available BCI could be effectively used for visual image classification in real-world scenarios -- when testing takes place at a completely different time than training data collection.

The task is difficult, as it requires relating the noisy and low-level EEG signals to complex and highly semantic visual categories. In this paper, we propose different learning approaches and show that simpler classifiers such as Ridge Regression with Gabor filtering of the input EEG signal could be more effective than the powerful Long Short Term Memory Networks and Convolutional Neural Networks in this case of limited and noisy training data. We analyzed the importance of each electrode for the visual classification task and noticed that the sensors with the highest accuracy were the ones that recorded brain activity from regions known to be correlated more with higher level recognition and cognitive processes and less to lower-level visual signal processing. The result is also in accordance with research in computer vision with deep neural networks, which shows that semantic visual features are learned only at higher levels of neural depth.

While EEG signals are weaker by themselves for the task of visual classification, we demonstrate that they could be powerful when combined with deep visual features extracted from the image, improving performance from 91% to over 97% in a multi-class recognition setting. Our tests show that EEG input brings additional information that is not learned by artificial deep networks on the given image training set. Thus, a commercially available BCI could be effectively used in conjunction with a deep learning based vision system to form together a stronger visual recognition system that is suitable for real-world applications.

Publications:

Nicolae Cudlenco, Nirvana Popescu and Marius Leordeanu, Reading into the mind's eye: boosting automatic visual recognition with EEG signals, Neurocomputing 2019

Learning from Synthetic Data to Detect Vital Signs in Videos

Automatically detecting vital signs in videos, such as the estimation of heart and respiration rates, is a challenging research problem in computer vision with important applications in the medical field. One of the key difficulties in tackling this task is the lack of sufficient supervised training data, which severely limits the use of powerful deep neural networks. In this paper we address this limitation through a novel deep learning approach, in which a recurrent deep neural network is trained to detect vital signs in the infrared thermal domain from purely synthetic data. What is most surprising is that our novel method for synthetic training data generation is general, relatively simple and uses almost no prior medical domain knowledge. Moreover, our system, which is trained in a purely automatic manner and needs no human annotation, also learns to predict the respiration or heart intensity signal for each moment in time and to detect the region of interest that is most relevant for the given task, e.g. the nose area in the case of respiration. We test the effectiveness of our proposed system on the recent LCAS dataset and obtain state-of-the-art results.

Publications:

Florin Condrea, Victor-Andrei Ivan and Marius Leordeanu, In Search of Life: Learning from Synthetic Data to Detect Vital Signs in Videos, Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

Machine learning methods for automatic analysis and prediction of COVID-19

Abstract [1]: Analyzing and understanding the transmission and evolution of the COVID-19 pandemic is mandatory, to be able to design the best social and medical policies, foresee their outcomes and deal with all the subsequent socio-economic effects. We address this important problem from a computational and machine learning perspective. We are, to the best of our knowledge, the first to tackle the task for the case of Romania. More specifically, we want to statistically estimate all the relevant parameters of the spreading of the new coronavirus COVID-19, such as the reproduction number, fatality rate or length of infectiousness period, based on Romanian patients, as well as be able to predict future outcomes. This endeavor is important, since it is well known that these factors vary across the globe, and might be dependent on many causes, including social, medical, age and genetic factors. At the core of our computational approach in [1] lies the recently published, state-of-the-art work Chowdhury et al. [2020] that proposes an improved version of SEIR, which is the classic, established model for infectious diseases. We want to infer all the parameters of the model, which govern the evolution of the pandemic in Romania, based on the only reliable, true measurement, which is the number of deaths. The true number of infected people is in reality impossible to know precisely. Once the model parameters are estimated, we are able to predict, all the other relevant measures, such as the number of exposed and infectious people and many other factors, as shown in this paper. To this end, we propose a self-supervised approach to train a deep convolutional network to guess the correct set of Modified-SEIR model parameters, given the observed number of daily fatalities. Then, starting from these initial parameters, we refine the solution with a stochastic coordinate descent approach. We compare our deep learning optimization scheme with the classic grid search approach and show great improvement in both computational time and prediction accuracy. We find an optimistic result in the case fatality rate for Romania which may be around 0.3% and we also demonstrate that our model is able to correctly predict the number of daily fatalities for up to three weeks in the future (the latest available data at the moment of writing), while staying around the intervals defined by the recent machine learning approach (Gu [2020]) currently used in the United States of America and the predictions from IHME (IHME COVID-19 health service utilization forecasting team [2020]).

Abstract [2]: We propose a regime separation for the analysis of Covid19 on Romania combined with mathematical models of SIR and SIRD. The main regimes we study are the free spread of the virus, the quarantine and partial relaxation and the last one is the relaxation regime. The main model we use is SIR which is a classical model, but because we can not fully trust the numbers of infected or recovered people we base our analysis on the number of deceased people which is more reliable. To actually deal with this we introduce a simple modification of the SIR model to account for the deceased separately. This in turn will be our base for fitting the parameters. The estimation of the parameters is done in two steps. The first one consists in training a neural network based on SIR models to detect the regime changes. Once this is done we t the main parameters of the SIRD model using a grid search. At the end, we make some predictions on what the evolution will be in a timeframe of a month with the fitted parameters.

Publications:

[1] RD Stochiţoiu, T Rebedea, I Popescu, M Leordeanu, A self-supervised neural-analytic method to predict the evolution of COVID-19 in Romania, arXiv preprint arXiv:2006.12926.

[2] M Petrica, R Stochitoiu, M Leordeanu, I Popescu, A regime switching on Covid19 analysis and prediction in Romania, arXiv preprint arXiv:2007.13494.

Machine Learning and Virtual Reality for Automatic Therapy of Phobias

In [1] we investigate various machine learning classifiers used in our Virtual Reality (VR) system for treating acrophobia. The system automatically estimates fear level based on multimodal sensory data and a self-reported emotion assessment. There are two modalities of expressing fear ratings: the 2-choice scale, where 0 represents relaxation and 1 stands for fear; and the 4-choice scale, with the following correspondence: 0—relaxation, 1—low fear, 2—medium fear and 3—high fear. A set of features was extracted from the sensory signals using various metrics that quantify brain (electroencephalogram—EEG) and physiological linear and non-linear dynamics (Heart Rate—HR and Galvanic Skin Response—GSR). The novelty consists in the automatic adaptation of exposure scenario according to the subject’s a
ective state. We acquired data from acrophobic subjects who had undergone an in vivo pre-therapy exposure session, followed by a Virtual Reality therapy and an in vivo evaluation procedure. Various machine and deep learning classifiers were implemented and tested, with and without feature selection, in both a user-dependent and user-independent fashion. The results showed a very high cross-validation accuracy on the training set and good test accuracies, ranging from 42.5% to 89.5%. The most important features of fear level classification were GSR, HR and the values of the EEG in the beta frequency range. For determining the next exposure scenario, a dominant role was played by the target fear level, a parameter computed by taking into account the patient’s estimated fear level.

Publications:

[1] O Bălan, G Moise, A Moldoveanu, M Leordeanu, F Moldoveanu, An Investigation of Various Machine and Deep Learning Techniques Applied in Automatic Fear Level Detection and Acrophobia Virtual Therapy, Sensors 20 (2), 496, 2020

[2] O Bălan, G Moise, L Petrescu, A Moldoveanu, M Leordeanu, F Moldoveanu, Emotion Classification Based on Biophysical Signals and Machine Learning Techniques, Symmetry 12 (1), 21, 2020

[3] O Bălan, G Moise, A Moldoveanu, F Moldoveanu, M Leordeanu, Classifying the Levels of Fear by Means of Machine Learning Techniques and VR in a Holonic-Based System for Treating Phobias. Experiments and Results, International Conference on Human-Computer Interaction, 357-372.

[4] O Balan, G Moise, A Moldoveanu, F Moldoveanu, M Leordeanu, AUTOMATIC ADAPTATION OF EXPOSURE INTENSITY IN VR ACROPHOBIA THERAPY, BASED ON DEEP NEURAL NETWORKS, European Conference on Information Systems, Stockholm, Sweden, 2019

[5] O Bălan, G Moise, A Moldoveanu, M Leordeanu, F Moldoveanu, Fear level classification based on emotional dimensions and machine learning techniques, Sensors 19 (7), 1738, 2019

Curriculum Learning for Generative Adversarial Networks

Despite the significant advances in recent years, Generative Adversarial Networks (GANs) are still notoriously hard to train. In this paper, we propose three novel curriculum learning strategies for training GANs. All strategies are first based on ranking the training images by their difficulty scores, which are estimated by a state-of-the-art image difficulty predictor. Our first strategy is to divide images into gradually more difficult batches. Our second strategy introduces a novel curriculum loss function for the discriminator that takes into account the difficulty scores of the real images. Our third strategy is based on sampling from an evolving distribution, which favors the easier images during the initial training stages and gradually converges to a uniform distribution, in which samples are equally likely, regardless of difficulty. We compare our curriculum learning strategies with the classic training procedure on two tasks: image generation and image translation. Our experiments indicate that all strategies provide faster convergence and superior results. For example, our best curriculum learning strategy applied on spectrally normalized GANs (SNGANs) fooled human annotators in thinking that generated CIFAR-like images are real in 25.0% of the presented cases, while the SNGANs trained using the classic procedure fooled the annotators in only 18.4% cases. Similarly, in image translation, the human annotators preferred the images produced by the Cycle-consistent GAN (CycleGAN) trained using curriculum learning in 40.5% cases and those produced by CycleGAN based on classic training in only 19.8% cases, 39.7% cases being labeled as ties.

Publications:

Petru Soviany, Claudiu Ardei, Radu Tudor Ionescu, Marius Leordeanu, Image Difficulty Curriculum for Generative Adversarial Networks (CuGAN), Winter Conference on Applications for Computer Vision (WACV), 2020.


Recurrent Space-time Graph Neural Networks

Learning in the space-time domain remains a very challenging problem in machine learning and computer vision. Current computational models for understanding spatio-temporal visual data are heavily rooted in the classical single-image based paradigm. It is not yet well understood how to integrate information in space and time into a single, general model. We propose a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene. Nodes and edges in our graph have dedicated neural networks for processing information. Nodes operate over features extracted from local parts in space and time and over previous memory states. Edges process messages between connected nodes at different locations and spatial scales or between past and present time. Messages are passed iteratively in order to transmit information globally and establish long range interactions. Our model is general and could learn to recognize a variety of high level spatio-temporal concepts and be applied to different learning tasks. We demonstrate, through extensive experiments and ablation studies, that our model outperforms strong baselines and top published methods on recognizing complex activities in video. Moreover, we obtain state-of-the-art performance on the challenging Something-Something human-object interaction dataset.

Publications:

Andrei Nicolicioiu, Iulia Duta and Marius Leordeanu, Recurrent Space-time Graph Neural Networks, Neural Information Processing Conference (NeurIPS), 2019

Smile Project - AI and Computer Vision meet Visual Art

What is Smile Project?

We propose a new system at the intersection of art, technology and artificial intelligence that responds to the user's smile, body poses and movements. Smile Project changes the way a user experiences art, through immersive human-AI interaction.

More information can be found on the Smile Project Site.

Exhibitions:

C. Lazar, N. Rosia, P. Lucaci and M. Leordeanu, SmileProject Deep Immersive Art with Realtime Human AI Interaction, exhibited at:

ArtWalkStreet Art Festival, Bucharest, September 2019

Diploma Art Festival, Bucharest, October 2019

Binar Festival, November 2019

Smile Project at ArtWalkStreet Exhibition, Bucharest - September 2019

Smile Project at DIPLOMA FESTIVAL - October 2019

Smile Project at BINAR 2019 Art and Technology Exhbition, Bucharest

Automatic detection and measurement of wood with a smartphone

An image processing system receives an image depicting a bundle of boards. The bundle of boards has a front face that is perpendicular to a long axis of boards and the image is captured at an angle relative to the long axis. The image processing system applies a homographic transformation to estimate a frontal view of the front face and identifies a plurality of divisions between rows in the estimate. For each adjacent pair of the plurality of divisions between rows, a plurality of vertical divisions is identified. The image processing system identifies a set of bounding boxes defined by pairs of adjacent divisions between rows and pairs of adjacent vertical divisions. The image processing system may filter and/or merge some bounding boxes to better match the bounding boxes to individual boards. Based on the bounding boxes, the image processing system determines the number of boards in the bundle.

Patents:

M Leordeanu, V Licaret, T Buzu, IA Muntianu, C Mutu, Automatic detection, counting, and measurement of lumber boards using a handheld device, US Patent 10,586,321, 2020

3D Object Detection with Geometric Constraints

We propose Shift R-CNN, a hybrid model for monocular 3D object detection, which combines deep learning with the power of geometry. We adapt a Faster R-CNN network for regressing initial 2D and 3D object properties and combine it with a least squares solution for the inverse 2D to 3D geometric mapping problem, using the camera projection matrix. The closed-form solution of the mathematical system, along with the initial output of the adapted Faster R-CNN are then passed through a final ShiftNet network that refines the result using our newly proposed Volume Displacement Loss. Our novel, geometrically constrained deep learning approach to monocular 3D object detection obtains top results on KITTI 3D Object Detection Benchmark, being the best among all monocular methods that do not use any pre-trained network for depth estimation.

Publications:

Andretti Naiden, Vlad Paunescu, Gyeongmo Kim, ByeongMoon Jeon, Marius Leordeanu, SHIFT R-CNN: DEEP MONOCULAR 3D OBJECT DETECTION WITH CLOSED-FORM GEOMETRIC CONSTRAINTS, 2019 IEEE International Conference on Image Processing (ICIP), 2019

More details about our work, with state of the art results on the challenging VOT2017 benchmark can be found here.

Learning a robust society of tracking parts using co-occurrence constraints

Object tracking is one of the first and most fundamental problems that has been addressed in computer vision. While it has attracted the interest of many researchers over several decades of computer vision, it is far from being solved. The task is hard for many reasons. Difficulties could come from severe changes in object appearance, presence of background clutter and occlusions that might take place in the video.

The only ground-truth knowledge given to the tracker is the bounding box of the object in the first frame. Thus, without knowing in advance the properties of the object being tracked, the tracking algorithm must learn them on the fly. It must adapt correctly and make sure it does not jump toward other objects in the background. That is why the possibility of drifting to the background poses on of the main challenges in tracking.

Finding consensus among many different language generating networks produces more meaningful descriptions.

Paper and demos are available on our project site.

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset. Project site is here.

Unsupervised learning from visual data is one of the most difficult challenges in computer vision, being a fundamental task for understanding how visual recognition works. From a practical point of view, learning from unsupervised visual input has an immense practical value, as very large quantities of unlabeled videos can be collected at low cost. In this paper, we address the task of unsupervised learning to detect and segment foreground objects in single images. We achieve our goal by training a student pathway, consisting of a deep neural network. It learns to predict from a single input image (a video frame) the output for that particular frame, of a teacher pathway that performs unsupervised object discovery in video. Our approach is different from the published literature that performs unsupervised discovery in videos or in collections of images at test time. We move the unsupervised discovery phase during the training stage, while at test time we apply the standard feed-forward processing along the student pathway. This has a dual benefit: firstly, it allows in principle unlimited possibilities of learning and generalization during training, while remaining very fast at testing. Secondly, the student not only becomes able to detect in single images significantly better than its unsupervised video discovery teacher, but it also achieves state of the art results on two important current benchmarks. Project site is here.

We offer a fast solution to object unsupervised discovery in video with state of the art performance. Our code is available here.

We address an essential problem in computer vision, that of unsupervised object segmentation in video, where a main object of interest in a video sequence should be automatically separated from its background. An efficient solution to this task would enable large-scale video interpretation at a high semantic level in the absence of the costly manually labeled ground truth. We propose an efficient unsupervised method for generating foreground object soft-segmentation masks based on automatic selection and learning from highly probable positive features. We show that such features can be selected efficiently by taking into consideration the spatio-temporal, appearance and motion consistency of the object during the whole observed sequence. We also emphasize the role of the contrasting properties between the foreground object and its background. Our model is created in two stages: we start from pixel level analysis, on top of which we add a regression model trained on a descriptor that considers information over groups of pixels and is both discriminative and invariant to many changes that the object undergoes throughout the video. We also present theoretical properties of our unsupervised learning method, that under some mild constraints is guaranteed to learn a correct discriminative classifier even in the unsupervised case. Our method achieves competitive and even state of the art results on the challenging Youtube-Objects and SegTrack datasets, while being at least one order of magnitude faster than the competition. We believe that the competitive performance of our method in practice, along with its theoretical properties, constitute an important step towards solving unsupervised discovery in video. Project site is here.

Our Sparse-to-Dense Matching Approach for Estimating Optical Flow and Occlusions

Our main objective is to design efficient algorithms for automatic video understanding both at the mid-level and higher-levels of interpretation, by computing dense correspondences between pairs of video frames at the mid-level, and then automatically discover meaningful visual patterns and their geometric and temporal relationships at higher levels of video understanding. We start by developing and applying our methods to several important computer vision tasks, such as motion estimation and occlusion region detection, unsupervised or weakly supervised object discovery in video, as well as classification of video with respect to different semantic classes, such as different object categories, scenes and activities. Project site can be accessed here.

Our SafeUAV system for learning to estimate depth and safe landing areas for UAVs from synthetic data.

The domain of Computer Vision studies and develops computational methods and systems that are capable of perceiving the world through images and videos in a smart manner, as close as possible to the level of human visual perception. Despite being a relatively new sub field in Artificial Intelligence and Robotics, Computer Vision currently enjoys a fast-growing development in both scientific research and industry. Recent success is due not only to the development of effective machine learning algorithms, but also to the substantial increase in computation power and data storage capabilities.

Computer Vision will play an important role in the world of tomorrow, having the potential to improve quality of life and future technologies. Here we are committed to develop such smart vision systems, which should be capable of operating in close relationship to various areas of robotics, such as autonomous aerial vehicles. We aim to develop high performance prototypes through scientific research, as well as create technological systems that have immediate usability. Thus, we shall try to discover new aspects derived from the connections between eye, sight and thinking. We should also develop computing systems that may support such complex cognitive processes.

The programme addresses Bachelor’s and Master’s students who are passionate about science and are eager to study such methods that enable the automated interpretation of images and videos. The objective of the programme is to help us form a team of students and engineers, with the appropriate theoretical and practical skills and knowledge. In particular, we will focus on the following three directions:

1) The identification of common areas between collections of video frames or photographs and their subsequent matching and geometric alignment.

2) The semantic segmentation of images. This task involves finding image regions that belong to certain semantic categories, such as residential areas, forests, parks, roads, lakes, or rivers, among others.

3) The detection and recognition of various object categories, such as houses or cars. We want to determine the way these object categories or area types interact with each other, at the contextual interpretation level, in order to facilitate their efficient detection and recognition.

What do we undertake? We aim to design and implement efficient algorithmic solutions. To this purpose, we will first address the task of mid-level interpretation. In particular we will focus on geometric image alignment, including the identification of correspondences between aerial image features. We will also develop methods for creating panoramic views - the result will be a single aerial map containing several frames aligned in the same coordinate system. We will also consider the estimation of the motion field (or dense matches) between successive frames. Furthermore, we will seek to develop machine learning methods for the categorization and detection of various areas and object types. We will also analyze their contextual relationship in order to obtain a full semantic segmentation and interpretation of aerial images and videos.

Many more details, papers, datasets and results can be found here.

Automatic discovery of foreground objects in video sequences is an important problem in computer vision with applications to object tracking, video segmentation and classification. We propose an efficient method for the discovery of object bounding boxes and the corresponding soft-segmentation masks across multiple video frames. We offer a graph matching formulation for bounding box selection and refinement using second and higher order terms. Our objective function takes into consideration local, frame-based information, as well as spatiotemporal and appearance consistency over multiple frames. First, we find an initial pool of candidate boxes using a novel and fast foreground estimation method in video, based on Principal Component Analysis. Then, we match the boxes across multiple frames using pairwise geometric and appearance terms. Finally, we refine their location and soft-segmentation using higher order potentials that establish appearance regularity over multiple frames. We test our method on the large scale YouTube-Objects dataset [2] and obtain state-of-the-art results on several object classes. Paper and code are available on the project site.

Context in the mind: What classes can trigger the idea of a ``train''?

Feature selection and ensemble learning are an essential problems in computer vision, important for category learning and recognition. Along with the fast-growing development of a wide variety of visual features and classifiers, it is becoming clearer that good feature selection and combination could make a real impact on constructing powerful classifiers for more difficult and higher-level recognition tasks. We propose efficient and accurate methods that efficiently discover sparse, compact patterns of input feature or classifiers, from a vast sea of candidates, with important optimality properties and low computational cost. We compare our approach to well-known, established methods such as boosting, SVMs and Greedy Forward-Backward Selection.

Project page for our joint selections of features and classifiers can be accessed here.

Our proposed boundary model follows the ramp/step shape of natural object edges at different levels of image interpretation

Generalized Edge Detection and Automatic Image Soft-segmentation


Boundary detection is a fundamental computer vision problem that is essential for a variety of tasks, such as contour and region segmentation, symmetry detection and object recognition and categorization. We propose a generalized formulation for boundary detection, with closed-form solution, applicable to the localization of different types of boundaries, such as object edges in natural images and occlusion boundaries from video. Our generalized boundary detection method (Gb) simultaneously combines low-level and mid-level image representations in a single eigenvalue problem and solves for the optimal continuous boundary orientation and strength. The closed-form solution to boundary detection enables our algorithm to achieve state of the art results at a significantly lower computational cost than current methods. We also propose two complementary novel components that can seamlessly be combined with Gb: first, we introduce a soft-segmentation procedure that provides region input layers to our boundary detection algorithm for a significant improvement in accuracy, at negligible computational cost; second, we present an efficient method for contour grouping and reasoning, which when applied as a final post-processing stage, further increases the boundary detection performance.

Project page for Generalized Boundary Detector (Gb), Soft-Segmentation and Contour Reasoning can be accessed here..

Representing object categories as graphs of local appearance features related through pair-wise or higher-order relationships.

Graph Matching and MAP Inference Methods: Theory and Applications

Graph Matching and MAP Inference in Markov Random Fields are important problems in computer vision that arise in many current applications. Here we present several efficient methods for graph and hyper-graph matching, MAP inference and parameter learning. We provide links to our publications, code and the datasets, on which we performed experiments and comparisons with other current approaches.

Project page for Methods for Graph Matching, Learning and MAP Inference can be accessed here.


Our smoothing-based optimization algorithm can be successfully applied to a wide range of tasks such as learning graph matching, object segmentation and general model fitting problems.

Smoothing-based Optimization: Theory and Practice

We propose an efficient method for complex optimization problems that often arise in computer vision. While our method is general and could be applied to various tasks, it was mainly inspired from problems in computer vision, and it borrows ideas from scale space theory. One of the main motivations for our approach is that searching for the global maximum through the scale space of a function is equivalent to looking for the maximum of the original function, with the advantage of having to handle fewer local optima. We demonstrate the effectiveness of our method on different computer vision tasks, such as learning for graph matching and automatic foreground-background segmentation. In our extensive experiments, our proposed smoothing-based optimization approach, significantly outperforms well established methods such as MCMC and Simulated Annealing.

Project page can be accessed here.