Special Session@IEEE SLT 2026
Special Session@IEEE SLT 2026
Background
With the rapid advancement of deep learning and generative AI, creating and manipulating media has become increasingly accessible. In the field of speech processing, audio/speech editing algorithms such as VoiceBox [2], A3T [3], SpeechX [4], and VoiceCraft [5] have enabled inexperienced users to generate highly realistic audio content with minimal effort [6][7]. These tools allow users to modify specific segments of existing speech without altering the rest of the recording, eliminating the need to re-record entire utterances. For instance, users can correct mispronunciations by simply editing the relevant portion, rather than regenerating the full audio. Although speech editing has valuable applications, it also poses significant risks of malicious use, such as manipulating/editing speech from public figures, misleading voice biometrics, and committing fraud. The presence of unaltered, real segments in edited speech can mislead the detector, making the identification of manipulations more challenging.
Justification
Despite the growing capabilities of audio/speech editing, several challenges remain from both the synthesizer and defender perspectives.
From the synthesizer’s point of view, a major challenge lies in ensuring that the untouched portions of the original speech remain acoustically and perceptually unchanged. Existing evaluation protocols are limited; for example, UniCATS [8] only supports short editing spans, making it insufficient for real-world use cases that require editing longer multi-word phrases. Although RealEdit [5] improves evaluation by leveraging ground-truth waveforms for acoustic-level metrics beyond WER, it is restricted to pre-existing data and cannot be applied when the ground truth is unavailable or comes from a different speaker. Moreover, leveraging linguistically enriched inputs, such as prosody-annotated text, for context-aware editing remains underexplored [9]. There is a limited number of unified frameworks that integrate speech editing with zero-shot text-to-speech, especially for tasks like continuation or infilling. In addition, controlling the attributes (like emotion [10], prosody [11], background [12], and sound scenes [13] etc.) of edited parts beyond content remains an underexplored area.
On the defender’s side, although audio/speech editing opens up exciting opportunities, it also poses serious risks when misused, such as in spreading misinformation or breaching security systems. Edited audio segments become increasingly difficult to detect when only small portions are modified, as the unedited parts can mislead both human listeners and automated decision systems. This calls for advanced defense strategies (like localization [14] and diarization [15]) that go beyond binary detection. Although partial deepfakes have been briefly discussed in the literature [14][16][17][18][19], very few studies have started to explore emerging neural speech editing [20]. Neural speech editing smooths transitions between edited parts and remaining parts through neural networks and enables fine-grained modifications not only to content but also to emotional attributes [10], making the edited parts even more challenging to detect. Furthermore, proactive defenses, such as watermarking of fake regions, remain underdeveloped.
Objectives
This special session aims to explore the emerging challenges of partially edited audio/speech/music/singing and to foster collaboration between the synthesis and defense communities. More details please refer to call-for-paper page.
Acknowledgment
We would like to thank Prof. Junichi Yamagishi for his valuable comments.
References
[1] Kässmann, Tobias, Yining Liu, and Danni Liu. "Speech Editing--a Summary." arXiv preprint arXiv:2407.17172 (2024).
[2] Le, Matthew, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, et al. "VoiceBox: Text-guided multilingual universal speech generation at scale." Advances in Neural Information Processing Systems, vol. 36, pp. 14005–14034, 2023.
[3] Bai, He, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, and Liang Huang. "A³T: Alignment-aware acoustic and text pretraining for speech synthesis and editing." In International Conference on Machine Learning (ICML), pp. 1399–1411. PMLR, 2022.
[4] Wang, Xiaofei, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, and Takuya Yoshioka. "SpeechX: Neural codec language model as a versatile speech transformer." IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3355–3364, 2024.
[5] Peng, Puyuan, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild." In Proceedings of ACL, pp. 12442–12462, 2024.
[6] Descript. https://www.descript.com/
[7] Morrison, Max, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, and Bryan Pardo. "Context-aware prosody correction for text-based speech editing." In ICASSP 2021-2021, pp. 7038–7042. IEEE, 2021
[8] Du, Chenpeng, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, and Kai Yu. "UniCATS: A unified context-aware text-to-speech framework with contextual VQ-diffusion and vocoding." In Proceedings of AAAI, vol. 38, no. 16, pp. 17924–17932, 2024.
[9] Mohammad, Baher, Magauiya Zhussip, and Stamatios Lefkimmiatis. "Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba." arXiv preprint arXiv:2510.04738, 2025.
[10] Liu, Rui, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, and Haizhou Li. "Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset.” Proc. Interspeech 2025, 4803-4807.
[11] Morrison, Max, Cameron Churchwell, Nathan Pruyne, and Bryan Pardo. "Fine-grained and interpretable neural speech editing." Proc. Interspeech 2024, 187-191
[12] Chen, Kuan-Yu, Jeng-Lin Li, and Jian-Jiun Ding. "SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement." arXiv preprint arXiv:2505.14066 (2025).
[13] Ellis, Daniel PW, Eduardo Fonseca, Ron J. Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R. Hershey, Aren Jansen, R. Channing Moore, and Manoj Plakal. "Recomposer: Event-roll-guided generative audio editing." arXiv preprint arXiv:2509.05256 (2025).
[14] Zhang, Lin, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, and Nicholas Evans. “An Initial Investigation for Detecting Partially Spoofed Audio”. Proc. Interspeech 2021, 4264-4268.
[15] Zhang, Lin, Xin Wang, Erica Cooper, Mireia Diez, Federico Landini, Nicholas Evans, and Junichi Yamagishi. "Spoof Diarization:" What Spoofed When" in Partially Spoofed Audio." (2024) Proc. Interspeech 2024, 502-506
[16] Zhang, Lin, Xin Wang, Erica Cooper, Nicholas Evans, and Junichi Yamagishi. "The Partialspoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance." IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022.
[17] Luong, Hieu-Thi, Haoyang Li, Lin Zhang, Kong Aik Lee, and Eng Siong Chng. "LlamaPartialspoof: An LLM-driven fake speech dataset simulating disinformation generation." In Proceedings of ICASSP 2025, pp. 1–5. IEEE, 2025.
[18] Yi, Jiangyan, Chu Yuan Zhang, Jianhua Tao, Chenglong Wang, Xinrui Yan, Yong Ren, Hao Gu, and Junzuo Zhou. "ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild." arXiv preprint arXiv:2408.04967, 2024.
[19] He, Jiayi, Jiangyan Yi, Jianhua Tao, Siding Zeng, and Hao Gu. "Manipulated Regions Localization For Partially Deepfake Audio: A Survey." arXiv preprint arXiv:2506.14396, 2025.
[20] Zhang, Y., Tian, B., Zhang, L., Duan, Z. "PartialEdit: Identifying Partial Deepfakes in the Era of Neural Speech Editing." In Proceedings of Interspeech 2025, pp. 5353–5357.
Call for Papers
Synthesis:
Techniques for partially editing content/background/emotion/prosody/object/etc. of audio/speech/music/singing or audio-visual media.
Methods to ensure acoustic and perceptual consistency after editing
Datasets, benchmarks, toolkit for partial audio/speech editing
Unified models for zero-shot TTS (continuation) and speech editing (infilling)
Partially audio/speech editing for more complicated scenarios, like long-form and/or multi-speaker conversations, noisy background, multilingual editing, etc.
Fairness, biases, harms, risks and socio-ethical failures of partial editing.
Defense:
Detection, localization, and diarization of partially edited audio
Proactive protecting under partial edits, like watermarking
Adaptation and generalization methods for identifying edits
Human vs. machine performance in detecting partially edited audio
Reasoning, explainability, interpretability, and transparency techniques for defense against partial edits in speech
Ethics of data collection, annotation, and use of data for speech editing.
Fairness, biases for defending against audio/speech/music/singing editing.
Joint defense against partial editing with other downstream tasks, like ASV, ASR, etc.
Other novel topics related to audio/speech/music/singing editing
Submission Link: Submit Paper
Instruction: We are following the same Author Instructions for INTERSPEECH 2026.
Organizers
Johns Hopkins University
The University of Texas at Austin
National Institute of Informatics
University of Rochester