PhD Candidate in Speech Technology -

building AI that understands subtle human communication:
tone, semantics, and visual cues.

Hey, I'm Xiyuan.

How humans and machines can communicate more naturally? My work looks at how we use not just words, but also tone of voice, facial expressions, and context to convey meaning, especially in tricky cases like irony or sarcasm. By combining insights from linguistics, cognitive science, and computer science, I study how machines can pick up on subtle cues and better understand language the way people actually use it: emotional, cultural, and shaped by social interaction.

🎓 Google Scholar 🌐 Linkedin 📗ORCID

Featured Highlights

Title: Improving sarcasm detection from speech and text through attention-based fusion exploiting the interplay of emotions and sentiments
Authors: X. Gao, S. Nayak, and M. Coler

Presented at: 186th Meeting of the Acoustical Society of America and the Canadian Acoustical Association.

Outcome: A multimodal fusion approach (audio + text + emotion + sentiment) improves sarcasm detection, outperforming SOTA by +4.79% F1 on MUStARD++.

[PDF]

SarcEmotiq is a deep learning-based tool for recognizing sarcasm in English audio.

Built on pre-trained models (MUStARD++), with the option to retrain on user data.
Integrates acoustic, textual, emotional, and sentiment cues into a unified attention-based fusion model.

[Github]

Most existing sarcasm research has focused on English. MCSD 1.0 fills this gap with the first high-quality multimodal Chinese sarcasm dataset, enabling cross-lingual and cross-cultural studies in sarcasm detection.

Size: 10.57 hours of video
Languages: Mandarin
Inter-annotator agreement: Fleiss’ κ = 0.74 (unweighted), 0.79 (certainty-weighted)
Validation: SVM baseline achieves 76.64% F1-score

[Download]

Research Snapshot

Focus

Pragmatic language understanding in Human-Machine Interaction (HMI)
Sarcasm as a test case due to its inherent complexity
Meaning beyond literal text: tone, facial expression, gestures, discourse context

Methods

Multimodal fusion of textual, audio, and visual cues
Leveraging linguistics, cognitive science, and machine learning
Attention-based / graph node based fusion upon pragmatic and affective signals

Impact

Toward human-centered AI that interprets emotionally charged, socially embedded language
Cross-lingual & cross-cultural generalization
Improved HMI systems that recognize subtlety, irony, and pragmatic intent

Page updated

Google Sites

Report abuse

PhD Candidate in Speech Technology -

building AI that understands subtle human communication: tone, semantics, and visual cues.

Hey, I'm Xiyuan.

Featured Highlights

Research Snapshot

building AI that understands subtle human communication:
tone, semantics, and visual cues.