Research Interest

Natural Language Processing (NLP), Question Answering, and Deep Learning.

Research Overview

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still challenging because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterancelevel sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model’s synchronized attention over modalities offers visual interpretability.

Hybrid Attention based Multimodal Network for Spoken Language Classification

We examine the utility of linguistic content and vocal characteristics for multimodal deep learning in human spoken language understanding. We present a deep multimodal network with both feature attention and modality attention to classify utterance-level speech data. The proposed hybrid attention architecture helps the system focus on learning informative representations for both modality-specific feature extraction and model fusion. The experimental results show that our system achieves state-of-the-art or competitive results on three published multimodal datasets. We also demonstrated the effectiveness and generalization of our system on a medical speech dataset from an actual trauma scenario. Furthermore, we provided a detailed comparison and analysis of traditional approaches and deep learning methods on both feature extraction and fusion.

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Human conversation analysis is challenging because meaning can be expressed through words, speech tones, even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The introduced hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluate the proposed system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperforms previous state-of-the-art research for classification and regressions tasks on three datasets. We also outperformed previous work at generalization tests on two commonly used datasets. We were able to achieve comparable performance using the proposed model instead of multiple individual models for co-existing labels. In addition, the easily-visualized modality and temporal attention demonstrate that the proposed attention mechanisms help feature selection and improve model interpretability.