ESDD Challenge

Environmental Sound Deepfake Detection Challenge

Grand Challenge at ICASSP 2026

Barcelona, Spain

4-8 May 2026

Call for Participation

Imagine sitting comfortably on your couch, watching TV late at night, when suddenly you hear the sharp sound of a fire alarm piercing through the quiet. At first, you might ignore it, thinking it’s just a scene from a movie or a false alarm outside. But then, you begin to hear hurried footsteps echoing in the hallway, followed by distant voices and the unmistakable wail of an emergency siren. Your heart starts racing. Without thinking twice, you jump off the couch, grab your keys, and rush out the door. However, what if those sounds were not real at all? What if they were artificially generated by AI models?

In recent years, the rapid advance of audio generation models has made it increasingly feasible to produce highly realistic environmental sounds, such as alarms, footsteps, and sirens. These deepfake environmental sounds can be used for creative media production, such as in film, gaming, and virtual reality. However, they also pose significant risks when misused to fabricate misleading content that can manipulate public perception or cause panic. Therefore, how to develop effective methods for detecting deepfake environmental sound has become an issue of growing concern within both academic and security communities.

To advance the field of Environmental Sound Deepfake Detection (ESDD), we proposed EnvSDD [1], the first large-scale curated dataset for ESDD, consisting of 45.25 hours of real and 316.74 hours of fake sound. Now, based on EnvSDD, we are launching the ICASSP 2026 Environmental Sound Deepfake Detection Challenge, which is the first challenge dedicated to this emerging and critical research area.

To address the key challenges encountered in real-life scenarios, we have designed two different tracks: ESDD in Unseen Generators (track 1) and Black-Box Low-Resource ESDD (track 2). Track 1 aims to explore the generalizability to unseen Text-to-Audio (TTA) [2] and Audio-to-Audio (ATA) [3] generators. Track 2 presents a more challenging scenario, simulating real-world deepfake detection under extreme uncertainty and limited data.

We warmly invite researchers from both academia and industry to participate in this challenge, exploring robust and effective solutions for these critical deepfake detection tasks.

Organizers

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

Important Links

Challenge Evaluation Plan, Google Registration Form, Google Group, Baseline Code

Codabench for Track 1, Codabench for Track 2

Questions can be sent to:

hanyin@kaist.ac.kr, yinh72906@gmail.com

References:

[1] Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, HaoheLiu, Wenwu Wang, and Mark D Plumbley, “EnvSDD: Benchmarking environmental sound deepfake detection,” Proc. Interspeech, 2025.

[2] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” in Proc. International Conference on Machine Learning (ICML), 2023, pp. 21450–21474.

[3] Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley, “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.32, pp. 2871–2883, 2024.

[4] Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang, “ESDD 2026: Environmental Sound Deepfake Detection Challenge Evaluation Plan,” Arxiv preprint: 2508.04529, 2025.

Page updated

Google Sites

Report abuse