Research Themes

The research of my team and collaborators is in the development of Machine Learning methods for analysis and interpretation of images, videos and neurophysiological signals (e.g. EEG). The research is driven by applications in Multimedia Indexing and retrieval, Autonomous Systems and Human Computer Interaction.

A (not updated) list of research themes:

Action recognition and localisation in image sequences

This work aims at developing methods for recognition and localisation of human and animal action categories in image sequences. Once trained, the methods should be able to detect and localise in an unseen image sequence, all the actions that belong to one of the known categories.

References


Facial (expression) analysis

Researchers: Ioannis Patras, Irene Kotsia, Sander Koelstra, Ognjen Rudinovic (Imperial College)

The main field of interest of this work includes computational intelligence and image processing techniques in order to analyze facial information in images and video. This includes tracking facial features, gaze tracking, recognition of activation of facial muscles (Facial Action Units) and recognition of facial expressions such as the ones associated with the six basic emotions (anger, disgust, fear, happiness, sadness and surprise). Research is conducted not only in controlled environments, but also under challenging conditions, such as occlusion, different lighting/pose and spontaneous facial expressions.

References

Human pose estimation

This line of work focuses on methods for the recovery of body parts from a single 2D image. The research focuses on learning direct mappings from image observations to the parameters that describe the 3D body pose. We have first developed a hierarchical approach to address this problem, learning piecewise mappings from observations to human poses. To achieve that we employed Support Vector Machines and multi-valued Relevance Vector Machine (RVM) regressors. Moreover, we developed a tensor regression framework employing two empirical risk functions formulated using either the Frobenius norm or the group sparsity norm. By using the group sparsity norm we also achieved automatic selection of the rank during the learning process by favoring the low rank decomposition of the tensorial weights.

References


Human Sensing for Human-Media Interaction

The research aims at the multimodal analysis of user behaviour when interacting with multimedia content. This includes analysis of both traditional modes of interaction (e.g. mouse and keyboard input) but mainly novel means of interaction such as EEG (encephalograph) signals, facial expressions and gaze patterns. The research is driven by applications in Multimedia Indexing and Retrieval as well as in Multimodal Human Computer Interaction.

Members: Ioannis Patras, Sander Koelstra, Stefanos Vrochidis (partially in ITI-CERTH)


EEG analysis for implicit tagging

In this work, we aim to analyze neuro-physiological user reactions to the presentation of multimedia, for indexing and retrieval. An advantage of using the EEG modality is that it can facilitate implicit tagging, that is it ican occur while the use passively watches multimedia content. We first analyze EEG signals in order to validate tags attached to video content. Subjects are shown a video and a tag and we aim to determine whether the shown tag was congruent with the presented video by detecting the occurrence of an N400 event-related potential. Tag validation could be used in conjunction with a vision-based recognition system as a feedback mechanism to improve the classification accuracy for multimedia indexing and retrieval. Independent Component Analysis and repeated measures ANOVA are used for analysis. Our experimental results show a clear occurrence of the N400 and a significant difference in N400 activation between matching and non-matching tags. The dataset we collected is now available, see here for details.

References

  • S. Koelstra, C. Muehl and I. Patras, "EEG analysis for implicit tagging of video data" in Workshop on Affective Brain-Computer Interfaces, Proc. ACII, 2009.

Facial expression recognition

In this work we propose a dynamic-texture-based approach to the recognition of facial Action Units (AUs, atomic facial gestures) and their temporal models (i.e., sequences of temporal segments: neutral, onset, apex, and offset) in near-frontal-view face videos. We introduce a novel approach to modeling the dynamics and the appearance in the face region of an input video based on Non-rigid Registration using Free-Form Deformations (FFDs). The extracted motion representation is used to derive motion orientation histogram descriptors in both the spatial and temporal domain that form further the input to a set of AU classifiers. Per AU, a combination of ensemble learners and Hidden Markov Models detects the presence of the AU in question and its temporal segments in an input image sequence. When tested for recognition of all 27 lower and upper face AUs, occurring alone or in combination in 264 sequences from the MMI facial expression database, the proposed method achieved an average event recognition accuracy of 89.2% for the MHI method and of 94.3% for the FFD method. The generalization performance of the FFD method has been tested using the Cohn-Kanade database. Finally, we also explored the performance on spontaneous expressions in the Sensitive Artificial Listener dataset.

References

  • S. Koelstra, M. Pantic and I.Patras, "A Dynamic Texture based Approach to Recognition of Facial Actions and their Temporal Models", IEEE Trans. Pattern Analysis and Machine Intelligence (Accepted)

  • S. Koelstra and M. Pantic, "Non-rigid registration using free-form deformations for recognition of facial actions and their temporal dynamics" in Proc. IEEE Conf. Face and Gesture Recognition, pages 1-8, 2008.


Robust Visual Tracking

This line of work focuses on methods for robust tracking of objects in image sequences addressing the issues of (partial) occlusions, changes in object's appearance (e.g. due to illumination changes) and structure (e.g. deformations), background clutter and tracking multiple interacting targets. We have developed methods for general object tracking where learning needs to be performed on the fly, as well as methods for domain-specific tracking such as facial feature tracking where the appearance, structure and dynamics of the target(s) can be learned offline.

Members: Ioannis Patras


Coupled Regression and Classification for Robust Visual Tracking

Researchers: Ioannis Patras

This paper addresses the problem of robust template tracking in image sequences. Our work falls within the discriminative framework in which the observations at each frame yield direct probabilistic predictions of the state of the target. Our primary contribution is that we explicitly address the problem that the prediction accuracy for different observations varies, and in some cases can be very low. To this end, we couple the predictor to a probabilistic classifier which, when trained, can determine the probability that a new observation can accurately predict the state of the target (that is, determine the relevance or reliability of the observation in question). In the particle filtering framework, we derive a recursive scheme for maintaining an approximation of the posterior probability of the state in which multiple observations can be used and their predictions moderated by their corresponding relevance. In this way the predictions of the relevant observations are emphasized, while the predictions of the irrelevant observations are suppressed. We apply the algorithm to the problem of 2D template tracking and demonstrate that the proposed scheme outperforms classical methods for discriminative tracking both in the case of motions which are large in magnitude and also for partial occlusions. See here for details.

References

Tracking Multiple Interacting Targets

Researchers: Ioannis Patras

The work focuses on methods for tracking multiple targets whose states (e.g. relative positions) are correlated. It is applied in the problem of tracking facial features in which anatomical constraints are learned from annotated data. The method has been extensively used for tracking facial features that were used in facial expression analysis. It is also applied in the problem of stereo tracking, where stereoscopic constrains are used in order to robustly track facial features and the iris in a stereoscopic image sequence. The latter is used for gaze tracking using a pair of webcameras.

References

  • I. Patras and M. Pantic, "Particle Filtering with Factorized Likelihoods for Tracking Facial Features". In IEEE International Conference on Face and Gesture Recognition, Seoul, South Korea, May 2004.

  • E. Pogalin, A. Redert, I. Patras and E.A. Hendriks, "Gaze Tracking by Using Factorized Likelihoods Particle Filtering and Stereo Vision" In Int'l Symposium on 3D Data Processing, Visualization and Transmission, North Carolina, USA, Jun. 2006.

  • I. Patras and M. Pantic, Tracking deformable motion, in Proc. IEEE Int'l Conf. on Systems, Man and Cybernetics (SMC '05) Waikoloa, Hawaii, Oct. 2005.


(Semantic) Segmentation

The research aims at the localisation in images and image sequences of instances of objects belonging to certain semantic categories. We model object structure and appearance as well the context at which the appear using graphical probabilistic models and/or classification schemes. We utilize both strongly annotated datasets (i.e. data for which the ground truth segmentation is available), weakly annotated datasets (e.g. the presence but not the location of an object in an image is given) as well as datasets from social sites where ambiguities in the labeling of the training datasets are typical.


Members: Ioannis Patras, Giuseppe Passino, Spiros Nikolopoulos (partially in ITI-CERTH)


Patch-based semantic labelling of images

This work studies models capable to analyse the structure of an image in terms of relationships among building parts, or patches. This process is aimed at discriminating the relevant clues that allow to pair specific low-level patches appearances with high-level semantic "concepts". In this context, the reasoning can dramatically benefit from the availability of structural data, i.e., information associated to the copresence and relative location of patches. The main challenge is how to take into account this information avoiding the complexity explosion associated to the intrinsic high dimensionality of the problem. Graphical models can provide a theoretical framework to build a learning paradigm able to efficiently infer relevant clues and to use them to ultimately derive the class of the objects depicted within a collection of images.

References

  • G. Passino, I. Patras, and E. Izquierdo, Latent Semantics Local Distribution for CRF-based Image Semantic Segmentation, in Proc. British Machine Vision Conference, 2009,

  • G. Passino, I. Patras, and E. Izquierdo, "Context Awareness in Graph-based Image Semantic Segmentation via Visual Word Distributions", in Proc. International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), 2009,

  • G. Passino, I. Patras, and E. Izquierdo, "On the Role of Structure in Part-based Object Detection," in Proc. 2008 IEEE International Conference on Image Processing (ICIP), 2008,

  • I. Patras, E. A. Hendriks and R. L. Lagendijk, ``Video Segmentation by MAP Labeling of Watershed Segments '', in IEEE Trans. Pattern Analysis and Machine Intelligence 23(3): 326-332, Mar. 2001.

Learning object models from social media

This work aims at combining the benefits of supervised and un-supervised learning by allowing the supervised methods to learn from training samples that are found in collaborative tagging environments, after some preprocessing. Specifically, drawing from a large pool of weakly annotated samples our goal is to collect a set of strongly annotated samples suitable for training an object classifier in a supervised manner. We do this by co-relating the most populated tag-word with the most populated visual-word in a set of weakly annotated images. Tag-words correspond to clusters of terms that are provided by social users to describe an image and are grouped based on their semantic relatedness. Visual-words correspond to clusters of image regions that are identified by an automatic segmentation algorithm and are grouped based on the visual similarity between them. The most populated tag-word is used to provide information about the object that the developed classifier is trained to identify, while the most populated visual-word is used to provide the set of strongly annotated samples for training the classifier in a supervised manner. Our method relies on the fact that due to the common background that most users share, the majority of them tend to contribute similar tags when faced with similar type of visual content. Given this assumption it is expected that as the pool of the weakly annotated images grows, the most populated visual-word in both tag and visual information space will converge into the same object.

References


Machine Learning for Multimedia Analysis

This line of work focuses on employing Machine Learning techniques for Multimedia Analysis. The research is driven by applications in Information Theoretic Learning, Tensor Learning, and Max-margin Non-negative Matrix factorization applied for the problems of object, human pose and actions recognition. We have developed methods for efficient feature selection and dimensionality reduction, for tensor learning classification and regression and for the maximization of the margin of a classifier in the feature space.

Information theoretic Learning

One of the most informative measures for feature extraction is Mutual Information (MI). In terms of Mutual Information, the optimal feature extraction creates new features that jointly have the largest dependency on the target class. However, obtaining an accurate estimate of a high-dimensional MI as well as optimizing with respect to it is not always easy, especially when only small training sets are available. In this work, we proposed an efficient tree-based method for feature extraction in which at each step a new feature is created by selecting and linearly combining two features such that the MI between the new feature and the class is maximized. Both the selection of the features to be combined and the estimation of the coefficients of the linear transform rely on estimating two-dimensional MIs. The estimation of the latter is computationally very efficient and robust. The effectiveness of our method has been evaluated on several real-world data sets.

References

Tensor Learning

This work aims at addressing the classification and regresion problems within a tensorial framework. We exploit the advantages offered by tensorial representations and propose several tensor learning models. We employ tensors in order to better retain and utilize information about the structure of the high dimensional space the data lie in, for example about the spatial arrangement of the pixel-based features in a 2D image. We formulate our algorithms considering that the weights parameters are expressed as a tensor of multiple modes and employ well known tensor decompositions. In that way the weights tensor in the resulting models can allow simultaneous projections to more than one directions along each mode or can be written as the multiplication of a core tensor with a matrix along each mode. The proposed classification algorithms deal with badly scaled data and are able to achieve compression. We also exploit the information provided by the total or the within-class covariance matrix and whiten the data, thus providing invariance to affine transformations in the feature space. Regarding regression, we approach the problem by employing two empirical risk functions both formulated using the Frobenius norm for regularization. We also use the group sparsity norm for regularization, favoring in that way the low rank decomposition of the tensorial weight and achieving the automatic selection of the rank during the learning process.

References

Max-margin Non-negative Matric factorization

In this work, we propose a maximum-margin framework for classification using Non- negative Matrix Factorization. In contrast to the previous approaches where the classifi- cation and the matrix factorization stages are independent, we incorporate the maximum margin constraints within the NMF formulation, i.e we solve for a base matrix that min- imizes the margin of the classifier in the low dimensional feature space. This results to a non-convex constrained optimization problem with respect to the bases, the projection coefficients and the separating hyperplane, which we propose to solve in an iterative way, where at each iteration we solve a set of convex sub-problems with respect to subsets of the unknown variables. By doing so, we obtain a bases matrix by which we extract features that maximize the margin of the resulting classifier. The performance of the pro- posed algorithm is evaluated on several publicly available datasets.

References