Research

Research Overview

Mobile Emotion Recognition via Multiple Physiological Signals using Convolution-augmented Transformer

Recognising and monitoring emotional states play a crucial role in mental health and well-being management. Importantly, with the widespread adoption of smart mobile and wearable devices, it has become easier to collect long-term and granular emotion-related physiological data passively, continuously, and remotely. This creates new opportunities to help individuals manage their emotions and well-being in a less intrusive manner using off-the-shelf low-cost devices. Pervasive emotion recognition based on physiological signals is, however, still challenging due to the difficulty to efficiently extract high-order correlations between physiological signals and users’ emotional states. In this paper, we propose a novel end-to-end emotion recognition system based on a convolution-augmented transformer architecture. Specifically, it can recognise users’ emotions on the dimensions of arousal and valence by learning both the global and local fine-grained associations and dependencies within and across multimodal physiological data (including blood volume pulse, electrodermal activity, heart rate, and skin temperature). We extensively evaluated the performance of our model using the K-EmoCon dataset, which is acquired in naturalistic conversations using off-the-shelf devices and contains spontaneous emotion data. Our results demonstrate that our approach outperforms the baselines and achieves state-of-the-art or competitive performance. We also demonstrate the effectiveness and generalizability of our system on another affective dataset which used affect inducement and commercial physiological sensors. [paper]

Behavioral and Physiological Signals-Based Deep Multimodal Approach for Mobile Emotion Recognition

With the rapid development of mobile and wearable devices, it is increasingly possible to access users' affective data in a more unobtrusive manner. On this basis, researchers have proposed various systems to recognize user's emotional states. However, most of these studies rely on traditional machine learning techniques and a limited number of signals, leading to systems that either do not generalize well or would frequently lack sufficient information for emotion detection in realistic scenarios. In this paper, we propose a novel attention-based LSTM system that uses a combination of sensors from a smartphone (front camera, microphone, touch panel) and a wristband (photoplethysmography, electrodermal activity, and infrared thermopile sensor) to accurately determine user's emotional states. We evaluated the proposed system by conducting a user study with 45 participants. Using collected behavioral (facial expression, speech, keystroke) and physiological (blood volume, electrodermal activity, skin temperature) affective responses induced by visual stimuli, our system was able to achieve an average accuracy of 89.2% for binary positive and negative emotion classification under leave-one-participant-out cross-validation. Furthermore, we investigated the effectiveness of different combinations of data signals to cover different scenarios of signal availability. [paper]

Benchmarking commercial emotion detection systems using realistic distortions of facial image datasets

Currently, there are several widely used commercial cloud-based services that attempt to recognize an individual’s emotions based on their facial expressions. Most research into facial emotion recognition has used high-resolution, front-oriented, full-face images. However, when images are collected in naturalistic settings (e.g., using smartphone’s frontal camera), these images are likely to be far from ideal due to camera positioning, lighting conditions, and camera shake. The impact these conditions have on the accuracy of commercial emotion recognition services has not been studied in full detail. To fill this gap, we selected five prominent commercial emotion recognition systems—Amazon Rekognition, Baidu Research, Face++, Microsoft Azure, and Affectiva—and evaluated their performance via two experiments. In Experiment 1, we compared the systems’ accuracy at classifying images drawn from three standardized facial expression databases. In Experiment 2, we first identified several common scenarios (e.g., partially visible face) that can lead to poor-quality pictures during smartphone use, and manipulated the same set of images used in Experiment 1 to simulate these scenarios. We used the manipulated images to again compare the systems’ classification performance, finding that the systems varied in how well they handled manipulated images that simulate realistic image distortion. Based on our findings, we offer recommendations for developers and researchers who would like to use commercial facial emotion recognition technologies in their applications. [paper]

Hybrid Attention based Multimodal Network for Spoken Language Classification

We examine the utility of linguistic content and vocal characteristics for multimodal deep learning in human spoken language understanding. We present a deep multimodal network with both feature attention and modality attention to classify utterance-level speech data. The proposed hybrid attention architecture helps the system focus on learning informative representations for both modality-specific feature extraction and model fusion. The experimental results show that our system achieves state-of-the-art or competitive results on three published multimodal datasets. We also demonstrated the effectiveness and generalization of our system on a medical speech dataset from an actual trauma scenario. Furthermore, we provided a detailed comparison and analysis of traditional approaches and deep learning methods on both feature extraction and fusion. [paper]

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still challenging because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterancelevel sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model’s synchronized attention over modalities offers visual interpretability. [paper]