During my time in the intelligent behaviour understanding group (iBug) I have extended the sequential methods I started working on in LAAS (Conditional Random Fields) and applied them to human sentiment analysis based on facial expressions. More specifically, I have concentrated on inter-personal (dis)agreement intensity analysis, based on visual and audio inputs.
Moreover, I obtained experience as a teaching assistant in the Machine Learning course (C395).
My work was financed from the SEWA project whose goal is to exploit the state of the art machine learning approaches to the analysis of facial, vocal and verbal behaviour, which can be combined and applied to realise more natural human-computer interaction.
The first publication I made [Rakicevic et al. 2015] proposes a novel approach to automatic estimation of (dis)agreement intensity levels a person expresses during a dyadic conversation.
Neural CORF model for intensity estimation of agreement intensity levels. The input to the model are the facial points extracted from the i-th input image sequence. These features are then projected via f(·) onto an ordinal line separating different levels, the temporal dynamics of which is modelled at the top level using the first-order transition model.
The data used in for this purpose was obtained from the MAHNOB-Mimicry database [Sun et al 2011]. From this database, I used 34, 15min long, video recorded sessions of 38 different participants discussing various topics (money, television, books, smoking etc.). The recording setup is shown in the upper figure on the right. The videos have a frame rate of 58 frames per second. For the first model, only 5 subjects were used. The input features used are the tracked facial points’ positions (49 points) obtained using the face tracker from [Asthana et al. 2014].
The (dis)agreement intensity labels were defined according to the Likert ordinal scale [Likert 1932] in the range:
The annotation was performed per-frame by an expert annotator. On the left the final distribution of the intensity levels is shown.
Examples of facial expressions with tracked facial points connected with corresponding annotated frames from the sample sequence of annotations vs. predictions. The presented predictions are obtained using the NCORF model with 10 hidden nodes.
Performance comparison of different models applied to (dis)agreement intensity level estimation.
After appropriate feature pre-processing (alignment and normalisation), down-sampling and segmentation, 329 sequences were obtained, on average 80 frames long. 5 fold subject-independent cross validation of the compared models.
The model we proposed, Neural Conditional Ordinal Random Field model, performs non-linear feature extraction from face images using the notion of Neural Networks (NN), while also modelling temporal and ordinal relationships between the agreement levels by means of a Conditional Ordinal Random Field (CORF) model [Kim and Pavlovic 2010]. The model structure is presented in the figure below.
The model’s performance was compared with static (NN and Support Vector Machines) and sequential (CRF) baseline, as well as state of the art models (CORF and Kernel CORF). Moreover, we cross-validated the architecture of the NN part and the best results are obtained for 10 hidden nodes. We show in our experiments that the proposed approach outperforms existing methods for modelling of sequential data. The measures used were the F1 score, Mean Absolute Error (MAE) and Intra-Class Correlation Coefficient (ICC).
The multimodal extensions employs facial features, as previously, and also audio features, extracted from the person in focus’ active speech segments. The 130 audio features used are the 65 described in [Schuller et al. 2013] and their derivatives. The extraction was done using the OpenSMILE software [Eyben et al. 2013]. The second contribution is the 2-phase joint decoupled optimisation of the NN and CORF parameters. This decoupled approach is proposed to avoid the slow learning which occurs when all the parameters are optimised together. Moreover, in this work, all 38 participants were included.
The proposed model diagram is presented in the figure above.
A sample sequence, depicting tracked facial points in the corresponding images, and audio signals. The inputs are passed through non-linear NN feature extractors, which also perform fusion of the two modalities. The output of the NN is then passed through ordinal functions f(·) that map it onto an ordinal line, classifying the target signal into different intensity levels ht = 1, ..., L of agreement. Temporal dependencies between target levels (ht and ht+1.) are also modeled to smooth out the predicted intensity
For more information please refer to the original paper.
The agreement level distribution with corresponding example images and tracked facial points.
Some of the results comparing to state-of-the-art methods.
PERFORMANCE COMPARISON OF DIFFERENT MODELS USING MULTIMODAL DATA APPLIED TO (DIS)AGREEMENT INTENSITY LEVEL ESTIMATION.
PERFORMANCE COMPARISON FOR DIFFERENT METHODS USING UNIMODAL DATA APPLIED TO (DIS)AGREEMENT INTENSITY LEVEL ESTIMATION.