Research

The goal of my research is to make AI better sense to the human behavior through emotion. Since 2009, I have made the efforts on the basic research on facial expression recognition, and also tried different topics on emotion recognition, for example, multi-modal emotion recognition. Till now, I am focusing on the micro-expression analysis from facial image and analyzing the interaction between peoples in a group environment (e.g., how the activate level of student collaboration when they are together studying and discussing something).


1. Facial Expression Recognition

During my Ph.D study, I primarily have focused on the feature extraction for facial expression recognition, and attempted to resolve the problems caused by illumination variation (e.g., in the room with poor illumination), partial occlusion (e.g., you wear the sunglasses) and varied view (e.g., you can free change your head pose). The main technologies have appeared in my doctoral thesis, don't hesitate to read it.

(1) Spatiotemporal Feature Descriptor

Feature representation is an important research topic on facial expression recognition in video sequences. We propose to use spatiotemporal monogenic binary patterns [1] to describe the appearance and motion information of the dynamic sequences. Firstly, we use monogenic signals analysis to extract the magnitude, the real picture and the imaginary picture of the orientation of each frame, since the magnitude can provide much appearance information and the orientation can provide complementary information. Secondly, the phase-quadrant encoding method and the local bit exclusive operator are utilized to encode the real and imaginary pictures from orientation in three orthogonal planes, and the local binary pattern operator is used to capture the texture and motion information from the magnitude through three orthogonal planes. Finally, both the concatenation method and multiple kernel learning method are exploited to handle the feature fusion. The experimental results on the Extended Cohn-Kanade and Oulu-CASIA facial expression databases demonstrate that the proposed methods perform better than the state-of-the-art methods, and are robust to illumination variations.

Fig 1.1.1. Procedure of STLMBP

STLMBP was verified to work promisingly in various illumination conditions. We further proposed an improved spatiotemporal feature descriptor based on STLMBP . The improved descriptor uses not only magnitude and orientation, but also the phase information, which provide complementary information. STLMBP and Improved STLMBP are evaluated in the Acted Facial Expression in the wild.

Fig. 1.1.2 Procedure of Improved STLMBP

Reference:

[1] Xiaohua Huang, Guoying Zhao, Matti Pietikäinen and Wenming Zheng. Spatiotemporal local monogenic binary patterns for facial expression recognition. Signal Processing Letters, Vol. 19, No. 3, pp. 243-246, 2012.

[2] Xiaohua Huang, Qiuhai He, Xiaopeng Hong, Guoying Zhao and Matti Pietikäinen. Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. Proceedings of 16th ACM International Conference on Multimodal Interaction, pp. 514-520, 2014.


(2) Resolve the problem by partial Occlusion for facial expression recognition

Facial occlusion is a challenging research topic in facial expression recognition (FER). It leads us to develop some interesting facial representations and occlusion detection methods in order to extend FER to uncontrolled environments. It should be noted that most of previous work is focused on these two issues separately, and on static images. We are thus motivated to propose a complete system consisting of facial representations, occlusion detection, and multiple feature fusion in video sequences. For achieving a robust facial representation due to the contributions of facial components to expressions, we propose an approach deriving six feature vectors from eyes, nose and mouth components to form a facial representation. These features with temporal cues are generated by dynamic texture and structural shape feature descriptors. On the other hand, occlusion detection is realized by the traditional classifiers or model comparison. Recently, sparse representation has been proposed as an efficient method against occlusion, while it is correlated with facial identity in FER, unless using an appropriate facial representation. Thus, we present an evaluation that demonstrates that the proposed facial representation is independent of facial identity. We then exploit sparse representation and residual statistics to occlusion detection of the image sequences. As concatenating six feature vectors into one causes the curse of dimensionality, we propose multiple feature fusion consisting of fusion module and weight learning. Experimental results on the Extended Cohn-Kanade database and simulated database demonstrate our framework outperforms the state-of-the-art methods for FER in normal videos, and especially, in partial occlusion videos. The idea and results were published in Pattern Recognition Letter.

Fig. 1.2 The proposed method of dynamic expression recognition against facial occlusion. (a) The procedure of the component-based facial expression representation. (b) An example of occlusion detection in eyes region.

Reference:

Xiaohua Huang, Guoying Zhao, Wenming Zheng and Matti Pietikäinen. Towards a dynamic expression recognition system under facial occlusion. Pattern Recognition Letters, Vol. 33, No. 16, pp. 2181-2191, 2012.


(3) Multi-view Facial Expression Recognition

Facial expression recognition (FER) has been predominantly utilized to analyze the emotional status of human beings. In practice nearly frontal-view facial images may not be available. Therefore, a desirable property of FER would allow the user to have any head pose. Some methods on non-frontal-view facial images were recently proposed to recognize the facial expression by building discriminative subspace in specific views. These approaches ignore (1) the discrimination of inter-class samples with the same view label and (2) the closeness of intra-class samples with all view labels. We proposed a new method to recognize arbitrary-view facial expressions by using discriminative neighborhood preserving embedding and multi-view concepts. It first captures the discriminative property of inter-class samples. In addition, it explores the closeness of intra-class samples with arbitrary view in a low-dimensional subspace. Experimental results on BU-3DFE and Multi-PIE databases show that our approach achieves promising results for recognizing facial expressions with arbitrary views.

Fig. 1.3 Illustration of multi-view discriminative neighborhood preserving embedding for arbitrary-view FER


2. Texture Classification

Recently, local quantized patterns (LQP) was proposed to use vector quantization to code complicated patterns with a large number of neighbors and several quantization levels. It uses lookup table technique to map patterns into the corresponding indices. Since LQP only considers the sign-based difference, it misses some discriminative information. We proposed completed local quantized patterns (CLQP) in texture classification. Firstly, we proposed to use the magnitude-based and orientation-based differences to complement the sign-based difference for LQP. In addition, we used vector quantization to learn three respective codebooks for local sign, magnitude and orientation patterns. For reducing the unnecessary computational time of initialization, we used preselected dominant patterns as the initialization in vector quantization. Our experimental results show that CLQP outperforms well-established features including LBP, LTP, CLBP and LQP on a range of challenging texture classification problems and an infant pain detection problem.

Fig.3. Overview of completed local quantized patterns

3. Micro-expression Recognition

Micro-expression is very rapid involuntary facial expressions which reveal suppressed affect. Facial micro-expressions reveal contradictions between facial expressions and the emotional state, enabling recognition of suppressed emotions. Micro-expressions are important for understanding humans’ deceitful behavior. Psychologists have been studying them since the 1960’s.

Currently the attention is elevated in both academic fields and in media. However, while general facial expression recognition (FER) has been intensively studied for years in computer vision, little research has been done in automatically analyzing microexpressions. The biggest obstacle to date has been the lack of a suitable database. In this paper [1] we present a novel Spontaneous Micro-expression Database SMIC, which includes 164 microexpression video clips elicited from 16 participants. Microexpression detection and recognition performance are provided as baselines. SMIC provides sufficient source material for comprehensive testing of automatic systems for analyzing microexpressions, which has not been possible with any previously published database.

Recently, we propose two spatiotemporal feature descriptors for micro-expression recognition.

3.1 Spatiotemporal Completed Local Quantized Pattern [2]

Spontaneous facial micro-expression analysis has become an active task for recognizing suppressed and involuntary facial expressions shown on the face of humans. Recently, Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) has been employed for micro-expression analysis. However, LBP-TOP suffers from two critical problems, causing a decrease in the performance of micro-expression analysis. It generally extracts appearance and motion features from the sign-based difference between two pixels but not yet considers other useful information. As well, LBP-TOP commonly uses classical pattern types which may be not optimal for local structure in some applications. This paper proposes SpatioTemporal Completed Local Quantization Patterns (STCLQP) for facial micro-expression analysis. Firstly, STCLQP extracts three interesting information containing sign, magnitude and orientation components. Secondly, an efficient vector quantization and codebook selection are developed for each component in appearance and temporal domains to learn compact and discriminative codebooks for generalizing classical pattern types. Finally, based on discriminative codebooks, spatiotemporal features of sign, magnitude and orientation components are extracted and fused. Experiments are conducted on three publicly available facial micro-expression databases. Some interesting findings about the neighboring patterns and the component analysis are concluded. Comparing with the state of art, experimental results demonstrate that STCLQP achieves a substantial improvement for analyzing facial micro-expressions.

3.2 Spatiotemporal Local Binary Pattern with Integral Projection (STLBP-IP)

Recently, there are increasing interests in inferring mirco-expression from facial image sequences. For micro-expression recognition, feature extraction is an important critical issue. In this paper, we proposes a novel framework based on a new spatiotemporal facial representation to analyze micro-expressions with subtle facial movement. Firstly, we propose to use an integral projection method based on difference images for obtaining horizontal and vertical projection, which can preserve the shape attribute of facial images and increase the discrimination for micro-expressions. Furthermore, we employ the local binary pattern operators to extract the appearance and motion features on horizontal and vertical projections. Intensive experiments are conducted on three availably published micro-expression databases for evaluating the performance of the method. Experimental results demonstrate that the new spatiotemporal descriptor can achieve promising performance in micro-expression recognition.

Appearance feature descriptor

Motion feature descriptor

3.3 Discriminative Spatiotemporal Local Binary Pattern with Improved Integral Projection [4, 6]

Recently, there are increasing interests in inferring mirco-expression from facial image sequences. Due to subtle facial movement of micro-expressions, feature extraction has become an important and critical issue for spontaneous facial micro-expression recognition. Recent works usually used spatiotemporal local binary pattern for micro-expression analysis. However, the commonly used spatiotemporal local binary pattern considers dynamic texture information to represent face images while misses the shape attribute of face images. On the other hand, their works extracted the spatiotemporal features from the global face regions, which ignore the discriminative information between two micro-expression classes. The above-mentioned problems seriously limit the application of spatiotemporal local binary pattern on micro-expression recognition. In this paper, we propose a discriminative spatiotemporal local binary pattern based on an improved integral projection to resolve the problems of spatiotemporal local binary pattern for micro-expression recognition. Firstly, we develop an improved integral projection for preserving the shape attribute of micro-expressions. Furthermore, an improved integral projection is incorporated with local binary pattern operators across spatial and temporal domains. Specifically, we extract the novel spatiotemporal features incorporating shape attributes into spatiotemporal texture features. For increasing the discrimination of micro-expressions, we propose a new feature selection based on Laplacian method to extract the discriminative information for facial micro-expression recognition. Intensive experiments are conducted on three availably published micro-expression databases including CASME, CASME2 and SMIC databases. We compare our method with the state-of-the-art algorithms. Experimental results demonstrate that our proposed method achieves promising performance for micro-expression recognition.


3.4 Motion magnification for micro-expression recognition [5, 7]

We investigated methods for spotting micro-expression from video data to be used, e.g., as a preprocessing step prior to recognition. In [4], we magnified micro-expression video by using motion magnification.

The work has been highlighted in international media such as MIT Technology Review and Daily Mail.

Magnified micro-expression demo: original video

and magnified video

(SMIC database, class: surprise)

Reference:

[1] Xiaobai Li, Tomas Pfister, Xiaohua Huang, Guoying Zhao, and Matti Pietikäinen. A spontaneous micro-expression database: inducement, collection and baseline. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 1-6, 2013.

[2] Xiaohua Huang, Guoying Zhao, Xiaopeng Hong, Wenming Zheng and Matti Pietikäinen. Spontaneous facial micro-expression analysis using spatiotemporal completed local quantized patterns, Neurocomputing, 2016.

[3] Xiaohua Huang, Guoying Zhao, Xiaopeng Hong, Wenming Zheng and Matti Pietikäinen. Texture description with completed local quantized patterns. The 18th Scandinavian Conference on Image Analysis, pp. 1-10, 2013.

[4] Xiaohua Huang, Sujing Wang, Xin Liu, Guoying Zhao, Xiaoyi Feng and Matti Pietikäinen. Spontaneous Facial Micro-Expression Recognition using Discriminative Spatiotemporal Local Binary Pattern with an Improved Integral Projection. arXiv, 2016.

[5] Xiaobai Li, Xiaopeng Hong, Antti Moilanen, Xiaohua Huang, Tomas Pfister, Guoying Zhao, Matti Pietikäinen. Reading hidden emotions: spontaneous micro-expression spotting and recognition. arXiv: 1511,00423v1, November 2015.

[6] Xiaohua Huang, Sujing Wang, Xin Liu, Guoying Zhao, Xiaoyi Feng and Matti Pietikäinen. Discriminative Spatiotemporal Local Binary Pattern with revisited Integral Projection for Spontaneous Facial Micro-Expression Recognition. IEEE Transactions on Affective Computing, 2017.

[7] Xiaobai Li, Xiaopeng Hong, Antti Moilanen, Xiaohua Huang, Tomas Pfister, Guoying Zhao, Matti Pietikäinen. Towards reading hidden emotions: spontaneous micro-expression spotting and recognition. IEEE Transactions on Affective Computing, 2017.


4. Multi-modal Emotion Recognition

The new on-going mega-trend of ubiquitous computing has resulted in that the research on human-centred computing systems has become a major topic of interest both in academia and industry. This means that the future technological solutions for human-computer interaction (HCI) must have better capabilities for understanding human behaviour than is currently the case. The need has created a new field of research, affective computing, aiming to improve the HCI by including the interpretation of the human emotions to the functionality of machines. By adapting to the user’s emotional state, more appropriate responses could be given.

In affective computing one of the main goals is the automatic recognition of human emotions. Emotions are fundamental for humans, impacting everyday activities such as perception, communication, learning and decision-making. In addition to speech, they are expressed through gestures, facial expressions and other non-verbal clues. Many physiological signals also contain information about the emotional state of humans. Various sensor signals can therefore be used when developing automatic emotion recognition capabilities for computers.

The main objective of the project is to develop and combine new methods in machine vision and biosignal processing to detect and classify the emotional state of humans. The proposed multimodal approach [1, 2] for automatic emotion recognition is unique in Finland and very rare globally. The project aims to create a technology applicable to measure emotional responses to advertisements and medical diagnostics of affective disorders.

The general goals of the project are:

  • To develop a firm technological and scientific basis for future applications of affective computing

  • To develop a unique approach to multi-modal affective computing that combines visual analysis of affective communicative behaviour of humans and physiological signal processing

  • To demonstrate the developed technology with realistic data

The project is a joint effort between the Center for Machine Vision Research and the Biosignal Processing Team, University of Oulu and it is funded by Tekes – the Finnish Funding Agency for Technology and Innovation.

Reference:

[1] Kortelainen J, Huang X, Li X, Laukka S, Pietikäinen M & Seppänen T (2012) Multimodal emotion recognition by combining physiological signals and facial expressions: a preliminary study. Proc. the 34th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'12), San Diego, CA, 5238-5241

[2] Huang X, Kortelainen J, Zhao G, Li X, Moilanen A, Seppänen T and Pietikäinen M. Multi-modal emotion analysis from facial expressions and electroencephalogram. Computer Vision and Image Understanding, 2016.


5. Group-level Emotion Recognition

Automatic emotion analysis and understanding has received much attention over the years in affective computing. Recently, there are increasing interests in inferring the emotional intensity of a group of people. At the same time, millions of images and videos have been uploaded on the Internet (e.g. in YouTube and Flickr), enabling us to explore images from a social event, such as family party. However, until recently, relatively little research has examined group emotion in an image. To advance affective computing research, it is indeed of interest to understand and model the affect exhibited by a group of people in images.

For group emotional intensity analysis, feature extraction and group expression model are two critical issues. We propose a new method [1] to estimate happiness intensity of a group of people in an image. Firstly, we combine Riesz transform and local binary pattern descriptor, namely Riesz-based volume local binary pattern (RVLBP), which considers neighbouring changes not only in the spatial domain of a face but also along the different Riesz faces. RVLBP is shown in Figure 5.1. Secondly, we exploit the continuous conditional random fields for constructing a new group expression model, in which considers global and local attributes. Finally, we utilize this model based on Riesz-based volume local binary pattern for estimating group happiness intensity. Some results on HAPPEI database [2] using the proposed method are presented in Figure 5.2.

Figure 5.1 Framework of RVLBP based on Riesz transform and LBP descriptor

Figure 5.2 Two examples on HAPPEI for happiness intensity estimation

Additionally, we proposed a multi-modal framework to perceive the affect of a group of people [3]. Due to the challenging environments, face may not provide enough information to GER. Relatively few studies have investigated multi-modal GER. Therefore, we propose a novel multi-modal approach based on a new feature description for understanding emotional state of a group of people in an image. In this paper, we firstly exploit three kinds of rich information containing face, upperbody and scene in a group-level image. Furthermore, in order to integrate multiple person's information in a group-level image, we propose an information aggregation method to generate three features for face, upperbody and scene, respectively. We fuse face, upperbody and scene information for robustness of GER against the challenging environments. Intensive experiments are performed on two challenging group-level emotion databases to investigate the role of face, upperbody and scene as well as multi-modal framework. Experimental results demonstrate that our framework achieves very promising performance for GER.

Reference:

[1] Huang, X, Dhall A, Zhao G, Goecke R, Pietikäinen, M. Riesz-based Volume Local Binary Pattern and A Novel Group Expression Model for Group Happiness Intensity Analysis. BMVC 2015.

[2] Dhall, A, Goecke R, Gedeon, T. Automatic Group Happiness Intensity Analysis. IEEE Transaction on Affective Computing, 2015.

[3] Huang, X, Dhall,Goecke R, Pietikäinen, M, Zhao, G. Multi-modal framework for analyzing the affect of a group of people. IEEE Transactions on Multimedia, 2018.