The Vision in Words project is an ambitious endeavor, in which we aim to develop efficient methods that will learn with minimal supervision to describe indoor videos into relatively complex natural language. Our approach is based on an extensive ongoing research effort of our group, with most recent results being presented in the following paper:
[1] Iulia Duta, Andrei Nicolicioiu, Vlad Bogolin and Marius Leordeanu, Mining for meaning: from vision to language through multiple networks consensus, British Machine Vision Conference (BMVC), 2018. PDF
Abstract
Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.
More information about this work and demos of our results could be found on our paper website here.
[2] Simion-Vlad Bogolin, Ioana Croitoru, Marius Leordeanu, A hierarchical approach to vision-based language generation: from simple sentences to complex natural language, accepted with Oral Presentation at International Conference on Computational Linguistics (COLING) 2020. Top conference in computational linguistics – Rank A
Abstract
Automatically describing videos in natural language is an ambitious problem, which could bridge our understanding of vision and language. We propose a hierarchical approach, by first generating video descriptions as sequences of simple sentences, followed at the next level by a more complex and fluent description in natural language. While the simple sentences describe simple actions in the form of (subject, verb, object), the second-level paragraph descriptions, indirectly using information from the first-level description, presents the visual content in a more compact, coherent and semantically rich manner. To this end, we introduce the first video dataset in the literature that is annotated with captions at two levels of linguistic complexity. We perform extensive tests that demonstrate that our hierarchical linguistic representation, from simple to complex language, allows us to train a two-stage network that is able to generate significantly more complex paragraphs than current one-stage approaches.
Our oral presentation can be seen in full here.
[3] Nicolae Cudlenco, Andrei Avram and Marius Leordeanu. Towards a common representation between vision and language in the form of graphs of events in space and time.
Please see the Synthetic Vision to Language site here and the corresponding full research report here, as part of the Vision in Words project.
Such graphs of events are used to generate stories and synthetic videos of such stories, on one hand. On the other hand, the system will also learn the inverse problem, that of inferring the story (in natural language) and the corresponding graph of events from the given input video. The possibility to generate automatically triplets of the type (graph of events, story in natural language, video), will make it possible for the deep learning system to learn the mapping between vision and language (in the synthetic, virtual environment). The hope is that our self-supervised learning approach in the virtual environment will be able to generalize on real data. For more information on this line of research, which is directly related to Vision in Words please visit the Synthetic Vision to Language website at this link. The description of the system and preliminary results can be found in the technical report here.
A relevant research task directly related to the main objectives of our project is the ability to automatically recognize main foreground objects in the scene. The ability to do so in an unsupervised manner and to learn about novel objects can function as an attention mechanism that is vital for finding the cues and pieces of information necessary for describing the visual scene into natural language.
The success of our work in vision to language translation will therefore depend on the ability to automatically detect general or unknown objects in the scene in an unsupervsied way. We already have relevant results on this task as presented in our article, published at the International Journal of Computer Vision, which is available here. The conference version of our work (published at ICCV 2017), more technical details and qualitative results can be found on our papers' website here.
[1] I.Croitoru, S.V. Bogolin, M. Leordeanu, Unsupervised learning of foreground object segmentation, International Journal of Computer Vision (IJCV) (2019) 127: 1279. https://doi.org/10.1007/s11263-019-01183-3. Top journal in computer vision (Impact factor 6.071).
The article, being in the red zone of top journals in artificial intelligence, received an award from UEFISDCI.
Abstract
Unsupervised learning poses one of the most difficult challenges in computer vision today. The task has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled videos can be collected at relatively low cost. In this paper, we address the unsupervised learning problem in the context of detecting the main foreground objects in single images. We train a student deep network to predict the output of a teacher pathway that performs unsupervised object discovery in videos or large image collections. Our approach is different from published methods on unsupervised object discovery. We move the unsupervised learning phase during training time, then at test time we apply the standard feed-forward processing along the student pathway. This strategy has the benefit of allowing increased generalization possibilities during training, while remaining fast at testing. Our unsupervised learning algorithm can run over several generations of student-teacher training. Thus, a group of student networks trained in the first generation collectively create the teacher at the next generation. In experiments our method achieves top results on three current datasets for object discovery in video, unsupervised image segmentation and saliency detection. At test time the proposed system is fast, being one to two orders of magnitude faster than published unsupervised methods.
[2] Andrei Nicolicioiu, Iulia Duta and Marius Leordeanu, Recurrent Space-time Graph Neural Networks, Neural Informationa Processing Systems (NeurIPS), Vancouver, 2019 PDF . Top conference in artificial intelligence – Rank A+
Abstract
Learning in the space-time domain remains a very challenging problem in machine learning and computer vision. Current computational models for understanding spatio-temporal visual data are heavily rooted in the classical single-image based paradigm. It is not yet well understood how to integrate information in space and time into a single, general model. We propose a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene. Nodes and edges in our graph have dedicated neural networks for processing information. Nodes operate over features extracted from local parts in space and time and over previous memory states. Edges process messages between connected nodes at different locations and spatial scales or between past and present time. Messages are passed iteratively in order to transmit information globally and establish long range interactions. Our model is general and could learn to recognize a variety of high level spatio-temporal concepts and be applied to different learning tasks. We demonstrate, through extensive experiments and ablation studies, that our model outperforms strong baselines and top published methods on recognizing complex activities in video. Moreover, we obtain state-of-the-art performance on the challenging Something-Something human-object interaction dataset.
[3] Marius Leordeanu, Unsupervised Learning in Space and Time: A Modern Approach for Computer Vision using Graph-based Techniques and Deep Neural Networks, Springer, May 2020. 298 pages. Print ISBN 978-3-030-42127-4. The book link on Springer website is here.
Abstract
This book is about unsupervised learning. That is one of the most challenging puzzles that we must solve and put together, piece by piece, in order to decode the secrets of intelligence. Here, we move closer to that goal by connecting classical computational models to newer deep learning ones, then build based on some fundamental and intuitive unsupervised learning principles. We want to reduce the unsupervised learning problem to a set of essential ideas and then develop the computational tools needed to implement them in the real world. Eventually we aim to imagine a universal unsupervised learning machine, the Visual Story Network. The book is written for young students as well as experienced researchers, engineers and professors. It presents computational models and optimization algorithms in sufficient technical detail, while also creating and maintaining an big intuitive picture about the main subject.
Different tasks, such as graph matching and clustering, feature selection, classifier learning, unsupervised object discovery and segmentation in video, teacher-student learning over multiple generations as well as recursive graph neural networks are brought together, chapter by chapter, under the same umbrella of unsupervised learning. In the current chapter we introduce the reader to the overall story of the book, which presents a unified image of the different topics that will be presented in detail in the chapters to follow. Besides sharing that main goal of learning without human supervision, the problems and tasks presented in the book also share common computational graph models and optimization methods, such as spectral graph matching, spectral clustering and the integer projected fixed point method. By bringing together similar mathematical formulations across different tasks, all guided by common intuitive principles towards an universal unsupervised learning system, the book invites the reader to absorb and then participate in the creation of the next generation of artificial intelligence.
[4] Iulia Duta, Andrei Nicolicioiu, Marius Leordeanu, Dynamic Regions Graph Neural Networks for Spatio-Temporal Reasoning, Object Representations for Learning and Reasoning Workshop at Neural Information Processing Conference (NeurIPS), 2020. Workshop at Top conference in artificial intelligence – Rank A+
Abstract
Graph Neural Networks are perfectly suited to capture latent interactions occurring in the spatio-temporal domain. But when an explicit structure is not available, as in the visual domain, it is not obvious what atomic elements should be represented as nodes. They should depend on the context and the kinds of relations that we are interested in. We are focusing on modeling relations between instances by proposing a method that takes advantage of the locality assumption to create nodes that are clearly localised in space. Current works are using external object detectors or fixed regions to extract features corresponding to graph nodes, while we propose a module for generating the regions associated with each node dynamically, without explicit object-level supervision. Conditioned on the input, for each node we predict the location and size of a region and use them to pool node features using a differentiable mechanism. Constructing these localised, adaptive nodes makes our model biased towards object-centric representations and we show that it improves the modeling of visual interactions. By relying on a few localized nodes, our method learns to focus on salient regions leading to a more explainable model.
Using machine learning and computer vision techniques to treat people's phobias or improve their visual and cognitive abilities is one of the great opportunities of artificial intelligence to improve our everyday lives. As part of the current project we are involved in several research tasks that are focused on this topic, with several articles already published in high quality journals and conferences:
[1] O Balan, G Moise, A Moldoveanu, F Moldoveanu, M Leordeanu, Automatic Adaptation of Exposure Intensity in VR Acrophobia Therapy, Based on Deep Neural Networks, European Conference on Information Systems, Stockholm, Sweden 2019. PDF
Abstract
This paper proposes a real-time Virtual Reality game for treating acrophobia that automatically tailors in-game exposure to heights to the players’ individual characteristics – affective state and physiological features. The elements of novelty are the automatic estimation of fear and the prediction of the next game level based on the electroencephalogram (EEG) and biophysical data – Galvanic Skin Response (GSR) and Heart Rate (HR). Two neural networks have been trained with the data recorded in an experiment where 4 subjects have been in-vivo and virtually exposed to various heights. In order to test the validity of the approach, the same users played the acrophobia game, using two modalities of expressing fear level. After completing a game level, the EEG and biophysical data were averaged and one neural network estimated the current fear score, while the other predicted the next game level. A measure of similarity between the self-estimated fear level during a game epoch and the fear level predicted by the first neural network showed an accuracy rate of 73% and 42% respectively for the two modalities of expressing fear level. 3 out of 4 users succeeded to obtain a fear level of 0 (complete relaxation) in the final game epoch.
[2] Cudlenco, N., Popescu, N. and Leordeanu, M., 2020. Reading into the mind’s eye: boosting automatic visual recognition with EEG signals. Neurocomputing. Volume 386, 21 April 2020, Pages 281-292. Q1 journal, with impact factor: 4.438. Link to publication site. Article PDF is available.
The article, being in the red zone of top journals in artificial intelligence, will soon receive an award from UEFISDCI.
Abstract
Classifying visual information is an apparently simple and effortless task in our everyday routine, but can we automatically predict what we see from signals emitted by the brain?
While other researchers have already attempted to answer this question, we are the first to show that a commercially available BCI could be effectively used for visual image classification in real-world scenarios – when testing takes place at a completely different time than training data collection. The task is difficult, as it requires relating the noisy and low-level EEG signals to complex and highly semantic visual categories. In this paper, we propose different learning approaches and show that simpler classifiers such as Ridge Regression with Gabor filtering of the input EEG signal could be more effective than the powerful Long Short Term Memory Networks and Convolutional Neural Networks in this case of limited and noisy training data. We analyzed the importance of each electrode for the visual classification task and noticed that the sensors with the highest accuracy were the ones that recorded brain activity from regions known to be correlated more with higher level recognition and cognitive processes and less to lower-level visual signal processing. The result is also in accordance with research in computer vision with deep neural networks, which shows that semantic visual features are learned only at higher levels of neural depth.
While EEG signals are weaker by themselves for the task of visual classification, we demonstrate that they could be powerful when combined with deep visual features extracted from the image, improving performance from 91% to over 97% in a multi-class recognition setting. Our tests show that EEG input brings additional information that is not learned by artificial deep networks on the given image training set. Thus, a commercially available BCI could be effectively used in conjunction with a deep learning based vision system to form together a stronger visual recognition system that is suitable for real-world applications.