Expanded description:
We introduce AMuSeD (Attentive Multimodal Sarcasm Detection), a deep neural network architecture enhanced by a novel two-phase bi-modal data augmentation strategy.
In the first phase, we generate diverse sarcastic text samples through Back Translation using multiple secondary languages. In the second phase, we synthesize sarcastic audio using a fine-tuned FastSpeech 2 model designed to retain sarcastic intonation, alongside a cloud-based TTS service. Together, these methods create aligned text–audio pairs that significantly expand the available training data.
We further evaluate different attention mechanisms for integrating modalities and find that self-attention is the most effective in capturing the interplay between text and audio. Experiments on the MUStARD dataset show that AMuSeD achieves an F1-score of 81.0% using only text and audio, surpassing even some models that incorporate all three modalities (text, audio, visual).
Expanded description:
Generating natural sarcastic speech remains a hard problem for AI, emotional subtleties hinge on prosody, timing, and tone, and we lack large annotated datasets of sarcastic speech. In this paper, we tackle these challenges by integrating a bi‑modal sarcasm detector directly into the speech synthesis pipeline.
First, we fine-tune a pretrained TTS model (FastSpeech 2) on diverse conversational speech, which gives it expressive capability. Then, we refine it further using a carefully curated sarcastic speech dataset. Crucially, we add a feedback loss from a sarcasm detection model, trained on both audio and text, to encourage the synthesized speech to sound sarcastic.
Objective and subjective evaluations show strong improvements: the synthesizer not only sounds more natural, but listeners also perceive sarcasm more clearly. This work marks a significant step toward emotionally intelligent and context-aware speech generation.
Expanded description:
Sarcasm, while central to everyday communication, presents challenges for both interpersonal interaction and human–machine systems. Prosodic cues play a crucial role in signaling sarcastic intent, yet most computational research has focused narrowly on text. This review addresses that gap by providing the first systematic synthesis of speech-based sarcasm recognition research.
The review charts the field’s progression from early unimodal approaches to current multimodal fusion techniques, covering datasets, feature extraction strategies, and classification methods. Findings highlight several key trends:
Scarcity and limitations of existing sarcasm-in-speech datasets.
Evolution of feature extraction from handcrafted acoustic features toward deep learning-based representations.
Advancements in classification methods, from unimodal models to multimodal architectures that integrate audio, text, and visual cues.
By connecting insights across linguistics, cognitive science, and AI, this review identifies crucial gaps, including the need for cross-cultural and multilingual sarcasm recognition, and emphasizes the importance of treating sarcasm as a multimodal phenomenon, not merely a text-based problem.
This work provides both a roadmap for researchers and a foundation for developing human-centered, culturally inclusive speech technologies that can better understand the subtleties of human communication.
Expanded description:
Data scarcity has limited progress in building robust sarcastic speech detectors, as manual annotation is expensive and time-consuming.
In this work, we introduce a novel annotation pipeline that leverages large language models (LLMs) to bootstrap sarcastic speech labeling. Using podcast audio as a source, over 11,000 utterances with transcripts were first annotated by GPT-4o and LLaMA 3. These automated labels were then verified and refined by human annotators, creating a reliable yet scalable dataset at a fraction of the usual cost.
The resulting dataset, PodSarc, enabled us to train sarcasm detection models that reached 73.63% F1 score, significantly improving over previous baselines. Beyond its immediate use for sarcasm detection, PodSarc demonstrates how LLMs can effectively accelerate dataset creation in domains where annotated speech is scarce.
Expanded description:
Sarcasm is signaled by subtle changes in pitch, timing, and expression, a nuance well studied in Indo-European languages but largely unexplored in Mandarin, a tonal language where pitch also defines word meaning. To bridge this gap, we developed the Multimodal Chinese Sarcasm Dataset (MCSD): a carefully curated, multimodal corpus spanning 10.57 hours of video, with manual annotations that incorporate annotator certainty. Our annotation framework delivered a Fleiss’ κ of 0.74 (unweighted) and an even stronger 0.79 using a certainty-weighted metric. A baseline SVM sarcasm detection model trained on MCSD achieved a commendable 76.64% F1, demonstrating the dataset’s robustness and value as a multilingual benchmark in sarcasm research.
Expanded description:
Detecting sarcasm requires more than combining modality signals: it demands recognizing how subtle intramodal cues (like tone or facial expression) interact with emotional contradictions across modalities. In this work, we leverage Graph Attention Networks (GATs) to model both intra-modal relationships and inter-modal emotional incongruities, for instance, cheerful words spoken with a deadpan tone. By structuring modalities as graph nodes and attending to their connections, the model can learn when a visual cue contradicts an audio cue, signaling sarcasm.
Our results demonstrate that this graph-based approach surpasses standard fusion models, offering improved accuracy in multimodal sarcasm detection, contributing to more context-aware systems that better understand the emotional complexity and subtlety of human communication.
Expanded description:
While prior research has shown that combining text, audio, and visual data improves sarcasm detection, many existing models treat these modalities separately and fail to capture their interactions.
In this work, we propose an approach that synergizes audio, text, sentiment, and emotion cues to improve sarcasm recognition. Sarcastic audio is first transcribed using Automatic Speech Recognition (ASR), producing text that can be analyzed with sentiment classifiers. At the same time, emotion recognition algorithms extract affective signals from the audio stream. These complementary sources of information are then fused, allowing the model to capture both alignment and incongruity between modalities, for example, positive words delivered in a flat or negative tone.
When evaluated on the MUStARD++ dataset (audio only), our method outperformed the previous state-of-the-art by +4.79% F1-score.
Expanded description:
Sarcasm often relies on a delicate balance between prosody (tone, pitch, rhythm) and semantics (word meaning). This study investigates how these cues interact to shape sarcastic intent. Using a dataset of sarcastic utterances from television shows, we analyzed prosodic features across three types of sarcasm, embedded, propositional, and illocutionary, which vary in the degree of semantic cues they provide.
The results reveal a functional trade-off. When sarcastic meaning is highly salient in the semantics (e.g., phrases that are obviously ironic), speakers relied less on prosodic modulation. Conversely, when semantic signals were weaker, prosodic cues like pitch shifts, intonation, and rhythm became more important in signaling sarcasm.
These findings highlight that sarcasm is not uniformly expressed but emerges through an interaction between prosody and semantics. At the phrase level, speakers adaptively weight these cues, reducing reliance on one when the other is strong.
Expanded description:
In this study, we present the first attempt at sarcastic speech synthesis in low-resource conditions. We leverage transfer learning, fine-tuning a pre-trained speech synthesis model on a dataset that includes multiple speaking styles, with a subset of sarcastic speech. The resulting system was able to generate sarcastic intonation, though synthesized output retained some robotic artifacts, reflecting both the promise and limitations of the approach.
Our results demonstrate that transfer learning offers a viable path forward for sarcasm synthesis in data-poor settings, achieving moderate performance gains despite limited resources. This proof-of-concept opens the door to more advanced methods, such as multimodal modeling, to further improve the expressiveness and naturalness of generated sarcastic speech.
Expanded description:
Sarcasm is widely used in everyday conversation, signaled through both acoustic cues (pitch, intonation, intensity) and visual cues (facial expression, eye gaze). While these markers are well-studied, attempts to build systems that automatically detect sarcasm in speech remain rare.
In this work, we address this gap by applying inductive transfer learning (ITL) with deep convolutional neural networks (DCNNs) for sarcasm detection in speech. Using the multimodal MUStARD dataset, we evaluate two pre-trained CNN architectures: VGGish (trained on large audio datasets) and Xception (trained on image datasets).
Our experiments show that VGGish, when used as an audio feature extractor, outperforms Xception, highlighting the importance of source and target datasets relevance in transfer learning. Both models, however, significantly surpass the traditional SVM baseline, yielding +7% and +5% F-score improvements in unimodal sarcasm detection.
This work demonstrates the promise of transfer learning in adapting models trained on broad speech or vision tasks to the niche domain of sarcasm detection, opening pathways for more effective multimodal sarcasm recognition systems.
Expanded description:
Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.