Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.
Unsupervised learning from visual data is one of the most difficult challenges in computer vision, being a fundamental task for understanding how visual recognition works. From a practical point of view, learning from unsupervised visual input has an immense practical value, as very large quantities of unlabeled videos can be collected at low cost. In this paper, we address the task of unsupervised learning to detect and segment foreground objects in single images. We achieve our goal by training a student pathway, consisting of a deep neural network. It learns to predict from a single input image (a video frame) the output for that particular frame, of a teacher pathway that performs unsupervised object discovery in video. Our approach is different from the published literature that performs unsupervised discovery in videos or in collections of images at test time. We move the unsupervised discovery phase during the training stage, while at test time we apply the standard feed-forward processing along the student pathway. This has a dual benefit: firstly, it allows in principle unlimited possibilities of learning and generalization during training, while remaining very fast at testing. Secondly, the student not only becomes able to detect in single images significantly better than its unsupervised video discovery teacher, but it also achieves state of the art results on two important current benchmarks
We address an essential problem in computer vision, that of unsupervised object segmentation in video, where a main object of interest in a video sequence should be automatically separated from its background. An efficient solution to this task would enable large-scale video interpretation at a high semantic level in the absence of the costly manually labeled ground truth. We propose an efficient unsupervised method for generating foreground object soft-segmentation masks based on automatic selection and learning from highly probable positive features. We show that such features can be selected efficiently by taking into consideration the spatio-temporal, appearance and motion consistency of the object during the whole observed sequence. We also emphasize the role of the contrasting properties between the foreground object and its background. Our model is created in two stages: we start from pixel level analysis, on top of which we add a regression model trained on a descriptor that considers information over groups of pixels and is both discriminative and invariant to many changes that the object undergoes throughout the video. We also present theoretical properties of our unsupervised learning method, that under some mild constraints is guaranteed to learn a correct discriminative classifier even in the unsupervised case. Our method achieves competitive and even state of the art results on the challenging Youtube-Objects and SegTrack datasets, while being at least one order of magnitude faster than the competition. We believe that the competitive performance of our method in practice, along with its theoretical properties, constitute an important step towards solving unsupervised discovery in video
Our main objective is to design efficient algorithms for automatic video understanding both at the mid-level and higher-levels of interpretation, by computing dense correspondences between pairs of video frames at the mid-level, and then automatically discover meaningful visual patterns and their geometric and temporal relationships at higher levels of video understanding. We start by developing and applying our methods to several important computer vision tasks, such as motion estimation and occlusion region detection, unsupervised or weakly supervised object discovery in video, as well as classification of video with respect to different semantic classes, such as different object categories, scenes and activities.
The domain of Computer Vision studies and develops computational methods and systems that are capable of perceiving the world through images and videos in a smart manner, as close as possible to the level of human visual perception. Despite being a relatively new sub field in Artificial Intelligence and Robotics, Computer Vision currently enjoys a fast-growing development in both scientific research and industry. Recent success is due not only to the development of effective machine learning algorithms, but also to the substantial increase in computation power and data storage capabilities.
Computer Vision will play an important role in the world of tomorrow, having the potential to improve quality of life and future technologies. Here we are committed to develop such smart vision systems, which should be capable of operating in close relationship to various areas of robotics, such as autonomous aerial vehicles. We aim to develop high performance prototypes through scientific research, as well as create technological systems that have immediate usability. Thus, we shall try to discover new aspects derived from the connections between eye, sight and thinking. We should also develop computing systems that may support such complex cognitive processes.
The programme addresses Bachelor’s and Master’s students who are passionate about science and are eager to study such methods that enable the automated interpretation of images and videos. The objective of the programme is to help us form a team of students and engineers, with the appropriate theoretical and practical skills and knowledge. In particular, we will focus on the following three directions:
1) The identification of common areas between collections of video frames or photographs and their subsequent matching and geometric alignment.
2) The semantic segmentation of images. This task involves finding image regions that belong to certain semantic categories, such as residential areas, forests, parks, roads, lakes, or rivers, among others.
3) The detection and recognition of various object categories, such as houses or cars. We want to determine the way these object categories or area types interact with each other, at the contextual interpretation level, in order to facilitate their efficient detection and recognition.
What do we undertake? We aim to design and implement efficient algorithmic solutions. To this purpose, we will first address the task of mid-level interpretation. In particular we will focus on geometric image alignment, including the identification of correspondences between aerial image features. We will also develop methods for creating panoramic views - the result will be a single aerial map containing several frames aligned in the same coordinate system. We will also consider the estimation of the motion field (or dense matches) between successive frames. Furthermore, we will seek to develop machine learning methods for the categorization and detection of various areas and object types. We will also analyze their contextual relationship in order to obtain a full semantic segmentation and interpretation of aerial images and videos.
Feature selection and ensemble learning are an essential problems in computer vision, important for category learning and recognition. Along with the fast-growing development of a wide variety of visual features and classifiers, it is becoming clearer that good feature selection and combination could make a real impact on constructing powerful classifiers for more difficult and higher-level recognition tasks. We propose an algorithm that efficiently discovers sparse, compact representations of input features or classifiers, from a vast sea of candidates, with important optimality properties, low computational cost and excellent accuracy in practice. Different from boosting, we start with a discriminant linear classification formulation that encourages sparse solutions. Then we obtain an equivalent unsupervised clustering problem that jointly discovers ensembles of diverse features. They are independently valuable but even more powerful when united in a cluster of classifiers. We evaluate our method on the task of large-scale recognition in video and show that it significantly outperforms classical selection approaches, such as AdaBoost and greedy forward-backward selection, and powerful classifiers such as SVMs, in speed of training and performance, especially in the case of limited training data.
Automatic discovery of foreground objects in video sequences is an important problem in computer vision with applications to object tracking, video segmentation and classification. We propose an efficient method for the discovery of object bounding boxes and the corresponding soft-segmentation masks across multiple video frames. We offer a graph matching formulation for bounding box selection and refinement using second and higher order terms. Our objective function takes into consideration local, frame-based information, as well as spatiotemporal and appearance consistency over multiple frames. First, we find an initial pool of candidate boxes using a novel and fast foreground estimation method in video, based on Principal Component Analysis. Then, we match the boxes across multiple frames using pairwise geometric and appearance terms. Finally, we refine their location and soft-segmentation using higher order potentials that establish appearance regularity over multiple frames. We test our method on the large scale YouTube-Objects dataset [2] and obtain state-of-the-art results on several object classes.
Boundary detection is a fundamental computer vision problem that is essential for a variety of tasks, such as contour and region segmentation, symmetry detection and object recognition and categorization. We propose a generalized formulation for boundary detection, with closed-form solution, applicable to the localization of different types of boundaries, such as object edges in natural images and occlusion boundaries from video. Our generalized boundary detection method (Gb) simultaneously combines low-level and mid-level image representations in a single eigenvalue problem and solves for the optimal continuous boundary orientation and strength. The closed-form solution to boundary detection enables our algorithm to achieve state of the art results at a significantly lower computational cost than current methods. We also propose two complementary novel components that can seamlessly be combined with Gb: first, we introduce a soft-segmentation procedure that provides region input layers to our boundary detection algorithm for a significant improvement in accuracy, at negligible computational cost; second, we present an efficient method for contour grouping and reasoning, which when applied as a final post-processing stage, further increases the boundary detection performance.
Graph Matching and MAP Inference in Markov Random Fields are important problems in computer vision that arise in many current applications. We present several efficient methods for graph and hyper-graph matching, MAP inference and parameter learning. We provide links to our publications, code and the datasets, on which we performed experiments and comparisons with other current approaches.
We propose an efficient method for complex optimization problems that often arise in computer vision. While our method is general and could be applied to various tasks, it was mainly inspired from problems in computer vision, and it borrows ideas from scale space theory. One of the main motivations for our approach is that searching for the global maximum through the scale space of a function is equivalent to looking for the maximum of the original function, with the advantage of having to handle fewer local optima. Our method works with any non-negative, possibly non-smooth function, and requires only the ability of evaluating the function at any specific point. The algorithm is based on a growth transformation, which is guaranteed to increase the value of the scale space function at every step, unlike gradient methods. To demonstrate its effectiveness we present its performance on a few computer vision applications, and show that in our experiments it is more effective than some well established methods such as MCMC, Simulated Annealing and the more local Nelder-Mead optimization method.