Research & Projects

2026

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation
Xiyuan Gao, Shubhi S., Karthik Gowda, Zhu Li, Shekhar Nayak, Nagendra Kumar, Matt Coler

IEEE Transactions on Affective Computing, 17 (1), 900-912. DOI: 10.1109/TAFFC.2025.3639406.

👁️‍🗨️ We present AMuSeD, a multimodal sarcasm detection system that combines bi-modal data augmentation with attention-based fusion. By generating new sarcastic text–audio pairs, AMuSeD reaches 81.0% F1-score using just text and audio — outperforming models that rely on all three modalities.

Expanded description:

We introduce AMuSeD (Attentive Multimodal Sarcasm Detection), a deep neural network architecture enhanced by a novel two-phase bi-modal data augmentation strategy.

In the first phase, we generate diverse sarcastic text samples through Back Translation using multiple secondary languages. In the second phase, we synthesize sarcastic audio using a fine-tuned FastSpeech 2 model designed to retain sarcastic intonation, alongside a cloud-based TTS service. Together, these methods create aligned text–audio pairs that significantly expand the available training data.

We further evaluate different attention mechanisms for integrating modalities and find that self-attention is the most effective in capturing the interplay between text and audio. Experiments on the MUStARD dataset show that AMuSeD achieves an F1-score of 81.0% using only text and audio, surpassing even some models that incorporate all three modalities (text, audio, visual).

2025

Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis
Zhu Li, Yuqing Zhang, Xiyuan Gao, Devraj Raghuvanshi, Nagendra Kumar, Shekhar Nayak, Matt Coler
Interspeech / Speech Synthesis Workshop 2025
👁️‍🗨️We introduce Sarcasm‑Aware TTS: a novel approach that injects feedback from a sarcasm detection model into the text-to-speech system, enabling it to generate speech that sounds naturally sarcastic.

Expanded description:

Generating natural sarcastic speech remains a hard problem for AI, emotional subtleties hinge on prosody, timing, and tone, and we lack large annotated datasets of sarcastic speech. In this paper, we tackle these challenges by integrating a bi‑modal sarcasm detector directly into the speech synthesis pipeline.

First, we fine-tune a pretrained TTS model (FastSpeech 2) on diverse conversational speech, which gives it expressive capability. Then, we refine it further using a carefully curated sarcastic speech dataset. Crucially, we add a feedback loss from a sarcasm detection model, trained on both audio and text, to encourage the synthesized speech to sound sarcastic.

Objective and subjective evaluations show strong improvements: the synthesizer not only sounds more natural, but listeners also perceive sarcasm more clearly. This work marks a significant step toward emotionally intelligent and context-aware speech generation.

Spoken in jest, detected in earnest: A systematic review of sarcasm recognition — multimodal fusion, challenges, and future prospects.
Xiyuan Gao, Shekhar Nayak, Matt Coler

IEEE Transactions on Affective Computing, 16(4), 2526–2544. DOI: 10.1109/TAFFC.2025.3612205

👁️‍🗨️ This is a systematic review of speech-based sarcasm recognition, tracing the shift from unimodal to multimodal approaches. It highlights dataset limitations, evolving feature extraction methods, and the importance of cross-cultural, multilingual, and multimodal perspectives on sarcasm detection.

Expanded description:

Sarcasm, while central to everyday communication, presents challenges for both interpersonal interaction and human–machine systems. Prosodic cues play a crucial role in signaling sarcastic intent, yet most computational research has focused narrowly on text. This review addresses that gap by providing the first systematic synthesis of speech-based sarcasm recognition research.

The review charts the field’s progression from early unimodal approaches to current multimodal fusion techniques, covering datasets, feature extraction strategies, and classification methods. Findings highlight several key trends:

Scarcity and limitations of existing sarcasm-in-speech datasets.
Evolution of feature extraction from handcrafted acoustic features toward deep learning-based representations.
Advancements in classification methods, from unimodal models to multimodal architectures that integrate audio, text, and visual cues.

By connecting insights across linguistics, cognitive science, and AI, this review identifies crucial gaps, including the need for cross-cultural and multilingual sarcasm recognition, and emphasizes the importance of treating sarcasm as a multimodal phenomenon, not merely a text-based problem.

This work provides both a roadmap for researchers and a foundation for developing human-centered, culturally inclusive speech technologies that can better understand the subtleties of human communication.

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection
Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

Interspeech 2025 (Rotterdam), DOI: 10.21437/Interspeech.2025-2074

👁️‍🗨️We present PodSarc, a new sarcastic speech dataset created with an LLM-assisted annotation pipeline that combines GPT-4o, LLaMA 3, and human verification. Models trained on PodSarc achieve strong performance (73.63% F1), establishing it as a benchmark resource for sarcasm detection research.

Expanded description:

Data scarcity has limited progress in building robust sarcastic speech detectors, as manual annotation is expensive and time-consuming.

In this work, we introduce a novel annotation pipeline that leverages large language models (LLMs) to bootstrap sarcastic speech labeling. Using podcast audio as a source, over 11,000 utterances with transcripts were first annotated by GPT-4o and LLaMA 3. These automated labels were then verified and refined by human annotators, creating a reliable yet scalable dataset at a fraction of the usual cost.

The resulting dataset, PodSarc, enabled us to train sarcasm detection models that reached 73.63% F1 score, significantly improving over previous baselines. Beyond its immediate use for sarcasm detection, PodSarc demonstrates how LLMs can effectively accelerate dataset creation in domains where annotated speech is scarce.

A Multimodal Chinese Dataset for Cross-lingual Sarcasm Detection
Xiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler

Interspeech 2025 (Rotterdam), DOI: 10.21437/Interspeech.2025‑1632

👁️‍🗨️MCSD is the first multimodal, Mandarin sarcasm dataset, over 10.5 hours of annotated video. It reaches a high Fleiss’ κ (0.79, weighted) and supports sarcasm detection models achieving 76.6% F1, paving the way for cross-cultural and language-inclusive sarcasm AI.

[Dataset download]

Expanded description:

Sarcasm is signaled by subtle changes in pitch, timing, and expression, a nuance well studied in Indo-European languages but largely unexplored in Mandarin, a tonal language where pitch also defines word meaning. To bridge this gap, we developed the Multimodal Chinese Sarcasm Dataset (MCSD): a carefully curated, multimodal corpus spanning 10.57 hours of video, with manual annotations that incorporate annotator certainty. Our annotation framework delivered a Fleiss’ κ of 0.74 (unweighted) and an even stronger 0.79 using a certainty-weighted metric. A baseline SVM sarcasm detection model trained on MCSD achieved a commendable 76.64% F1, demonstrating the dataset’s robustness and value as a multilingual benchmark in sarcasm research.

Intra-modal Relation & Emotional Incongruity Learning using Graph Attention Networks
Devraj Raghuvanshi, Xiyuan Gao, Zhu Li, Shubhi Bansal, Matt Coler, Nagendra Kumar, Shekhar Nayak

ICASSP 2025 – IEEE International Conference on Acoustics, Speech, and Signal Processing DOI: 10.1109/ICASSP49660.2025.10887864

👁️‍🗨️We introduce a novel sarcasm detection model using Graph Attention Networks to learn nuanced relationships within each modality and emotional mismatches across modalities.

Expanded description:

Detecting sarcasm requires more than combining modality signals: it demands recognizing how subtle intramodal cues (like tone or facial expression) interact with emotional contradictions across modalities. In this work, we leverage Graph Attention Networks (GATs) to model both intra-modal relationships and inter-modal emotional incongruities, for instance, cheerful words spoken with a deadpan tone. By structuring modalities as graph nodes and attending to their connections, the model can learn when a visual cue contradicts an audio cue, signaling sarcasm.

Our results demonstrate that this graph-based approach surpasses standard fusion models, offering improved accuracy in multimodal sarcasm detection, contributing to more context-aware systems that better understand the emotional complexity and subtlety of human communication.

2024

Improving Sarcasm Detection from Speech and Text through Attention-based Fusion Exploiting the Interplay of Emotions and Sentiments
Xiyuan Gao, Shekhar Nayak, Matt Coler

(2024) In Proceedings of Meetings on Acoustics, DOI: 10.1121/2.0001918

👁️‍🗨️We design an attention-based fusion model that blends emotional cues from speech with sentiment from text, boosting sarcasm detection by +4.79% F1‑score.

Expanded description:

While prior research has shown that combining text, audio, and visual data improves sarcasm detection, many existing models treat these modalities separately and fail to capture their interactions.

In this work, we propose an approach that synergizes audio, text, sentiment, and emotion cues to improve sarcasm recognition. Sarcastic audio is first transcribed using Automatic Speech Recognition (ASR), producing text that can be analyzed with sentiment classifiers. At the same time, emotion recognition algorithms extract affective signals from the audio stream. These complementary sources of information are then fused, allowing the model to capture both alignment and incongruity between modalities, for example, positive words delivered in a flat or negative tone.

When evaluated on the MUStARD++ dataset (audio only), our method outperformed the previous state-of-the-art by +4.79% F1-score.

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm
Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

Interspeech 2024, DOI: 10.21437/Interspeech.2024-1962

👁️‍🗨️Sarcasm can be carried by how we speak (prosody) or what we say (semantics). This study shows a trade-off between prosodic and semantic cues: when sarcastic meaning is clear from words alone, prosody matters less, but when semantics are ambiguous, vocal cues play a greater role.

Expanded description:

Sarcasm often relies on a delicate balance between prosody (tone, pitch, rhythm) and semantics (word meaning). This study investigates how these cues interact to shape sarcastic intent. Using a dataset of sarcastic utterances from television shows, we analyzed prosodic features across three types of sarcasm, embedded, propositional, and illocutionary, which vary in the degree of semantic cues they provide.

The results reveal a functional trade-off. When sarcastic meaning is highly salient in the semantics (e.g., phrases that are obviously ironic), speakers relied less on prosodic modulation. Conversely, when semantic signals were weaker, prosodic cues like pitch shifts, intonation, and rhythm became more important in signaling sarcasm.

These findings highlight that sarcasm is not uniformly expressed but emerges through an interaction between prosody and semantics. At the phrase level, speakers adaptively weight these cues, reducing reliance on one when the other is strong.

2022 - 2023

Speech Synthesis for Sarcasm in Low-Resource Scenarios: A Pilot Study
Zhu Li, Xiyuan Gao, Shekhar Nayak, Matt Coler

12th ISCA Speech Synthesis Workshop (SSW2023),

👁️‍🗨️Generating sarcastic speech is difficult due to scarce data. SarcasticSpeech explores transfer learning to fine-tune pre-trained models on diverse speech styles, achieving moderate improvements in synthesizing sarcasm.

Expanded description:

In this study, we present the first attempt at sarcastic speech synthesis in low-resource conditions. We leverage transfer learning, fine-tuning a pre-trained speech synthesis model on a dataset that includes multiple speaking styles, with a subset of sarcastic speech. The resulting system was able to generate sarcastic intonation, though synthesized output retained some robotic artifacts, reflecting both the promise and limitations of the approach.

Our results demonstrate that transfer learning offers a viable path forward for sarcasm synthesis in data-poor settings, achieving moderate performance gains despite limited resources. This proof-of-concept opens the door to more advanced methods, such as multimodal modeling, to further improve the expressiveness and naturalness of generated sarcastic speech.

Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in Speech
Xiyuan Gao, Shekhar Nayak, Matt Coler

Interspeech 2022, DOI: 10.21437/Interspeech.2022-11323

👁️‍🗨️ This study applies inductive transfer learning with deep CNNs (VGGish, Xception) on the MUStARD dataset, achieving up to 7% F-score improvement over SVM baselines.

Expanded description:

Sarcasm is widely used in everyday conversation, signaled through both acoustic cues (pitch, intonation, intensity) and visual cues (facial expression, eye gaze). While these markers are well-studied, attempts to build systems that automatically detect sarcasm in speech remain rare.

In this work, we address this gap by applying inductive transfer learning (ITL) with deep convolutional neural networks (DCNNs) for sarcasm detection in speech. Using the multimodal MUStARD dataset, we evaluate two pre-trained CNN architectures: VGGish (trained on large audio datasets) and Xception (trained on image datasets).

Our experiments show that VGGish, when used as an audio feature extractor, outperforms Xception, highlighting the importance of source and target datasets relevance in transfer learning. Both models, however, significantly surpass the traditional SVM baseline, yielding +7% and +5% F-score improvements in unimodal sarcasm detection.

This work demonstrates the promise of transfer learning in adapting models trained on broad speech or vision tasks to the niche domain of sarcasm detection, opening pathways for more effective multimodal sarcasm recognition systems.

Under review

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning
Zhu Li, Xiyuan Gao, Shekhar Nayak, Matt Coler

👁️‍🗨️SarcasmMiner is a reinforcement learning post-training framework for multimodal sarcasm detection that models the task as structured reasoning across text, audio, and visual cues while reducing hallucination. Using dual-track distillation, a generative reward model, and GRPO with separate rewards for accuracy and reasoning quality, it improves MUStARD++ F1 to 70.22%, outperforming both zero-shot and supervised finetuning baselines.

Expanded description:

Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.

Page updated

Google Sites

Report abuse

Research & Projects

2026

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data AugmentationXiyuan Gao, Shubhi S., Karthik Gowda, Zhu Li, Shekhar Nayak, Nagendra Kumar, Matt Coler

IEEE Transactions on Affective Computing, 17 (1), 900-912. DOI: 10.1109/TAFFC.2025.3639406.

2025

Spoken in jest, detected in earnest: A systematic review of sarcasm recognition — multimodal fusion, challenges, and future prospects.Xiyuan Gao, Shekhar Nayak, Matt Coler

IEEE Transactions on Affective Computing, 16(4), 2526–2544. DOI: 10.1109/TAFFC.2025.3612205

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm DetectionZhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

Interspeech 2025 (Rotterdam), DOI: 10.21437/Interspeech.2025-2074

A Multimodal Chinese Dataset for Cross-lingual Sarcasm DetectionXiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler

Interspeech 2025 (Rotterdam), DOI: 10.21437/Interspeech.2025‑1632

👁️‍🗨️MCSD is the first multimodal, Mandarin sarcasm dataset, over 10.5 hours of annotated video. It reaches a high Fleiss’ κ (0.79, weighted) and supports sarcasm detection models achieving 76.6% F1, paving the way for cross-cultural and language-inclusive sarcasm AI.

[Dataset download]

Intra-modal Relation & Emotional Incongruity Learning using Graph Attention NetworksDevraj Raghuvanshi, Xiyuan Gao, Zhu Li, Shubhi Bansal, Matt Coler, Nagendra Kumar, Shekhar Nayak

ICASSP 2025 – IEEE International Conference on Acoustics, Speech, and Signal Processing DOI: 10.1109/ICASSP49660.2025.10887864

👁️‍🗨️We introduce a novel sarcasm detection model using Graph Attention Networks to learn nuanced relationships within each modality and emotional mismatches across modalities.

2024

Improving Sarcasm Detection from Speech and Text through Attention-based Fusion Exploiting the Interplay of Emotions and SentimentsXiyuan Gao, Shekhar Nayak, Matt Coler

(2024) In Proceedings of Meetings on Acoustics, DOI: 10.1121/2.0001918

👁️‍🗨️We design an attention-based fusion model that blends emotional cues from speech with sentiment from text, boosting sarcasm detection by +4.79% F1‑score.

A Functional Trade-off between Prosodic and Semantic Cues in Conveying SarcasmZhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

Interspeech 2024, DOI: 10.21437/Interspeech.2024-1962

👁️‍🗨️Sarcasm can be carried by how we speak (prosody) or what we say (semantics). This study shows a trade-off between prosodic and semantic cues: when sarcastic meaning is clear from words alone, prosody matters less, but when semantics are ambiguous, vocal cues play a greater role.

2022 - 2023

Speech Synthesis for Sarcasm in Low-Resource Scenarios: A Pilot StudyZhu Li, Xiyuan Gao, Shekhar Nayak, Matt Coler

12th ISCA Speech Synthesis Workshop (SSW2023),

👁️‍🗨️Generating sarcastic speech is difficult due to scarce data. SarcasticSpeech explores transfer learning to fine-tune pre-trained models on diverse speech styles, achieving moderate improvements in synthesizing sarcasm.

Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in SpeechXiyuan Gao, Shekhar Nayak, Matt Coler

Interspeech 2022, DOI: 10.21437/Interspeech.2022-11323

👁️‍🗨️ This study applies inductive transfer learning with deep CNNs (VGGish, Xception) on the MUStARD dataset, achieving up to 7% F-score improvement over SVM baselines.

Under review

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm ReasoningZhu Li, Xiyuan Gao, Shekhar Nayak, Matt Coler

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation
Xiyuan Gao, Shubhi S., Karthik Gowda, Zhu Li, Shekhar Nayak, Nagendra Kumar, Matt Coler

Spoken in jest, detected in earnest: A systematic review of sarcasm recognition — multimodal fusion, challenges, and future prospects.
Xiyuan Gao, Shekhar Nayak, Matt Coler

Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection
Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

A Multimodal Chinese Dataset for Cross-lingual Sarcasm Detection
Xiyuan Gao, Bruce Xiao Wang, Meiling Zhang, Shuming Huang, Zhu Li, Shekhar Nayak, Matt Coler

Intra-modal Relation & Emotional Incongruity Learning using Graph Attention Networks
Devraj Raghuvanshi, Xiyuan Gao, Zhu Li, Shubhi Bansal, Matt Coler, Nagendra Kumar, Shekhar Nayak

Improving Sarcasm Detection from Speech and Text through Attention-based Fusion Exploiting the Interplay of Emotions and Sentiments
Xiyuan Gao, Shekhar Nayak, Matt Coler

(2024) In Proceedings of Meetings on Acoustics, DOI: 10.1121/2.0001918

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm
Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

Speech Synthesis for Sarcasm in Low-Resource Scenarios: A Pilot Study
Zhu Li, Xiyuan Gao, Shekhar Nayak, Matt Coler

Deep CNN-based Inductive Transfer Learning for Sarcasm Detection in Speech
Xiyuan Gao, Shekhar Nayak, Matt Coler

SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning
Zhu Li, Xiyuan Gao, Shekhar Nayak, Matt Coler