Abstract
Audio recorded in real-world environments often contains a mixture of foreground speech and background sounds. We define such audio as comprising two components: (i) speech, i.e., linguistically meaningful speech produced by the primary speaker, and (ii) environmental sound, i.e., any non-speech background or non-target speech.
With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. For example, replacing the background while leaving the foreground human speech unchanged, or modifying the speech content while preserving the background. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners.
To address this gap, we launch the ICME 2026 Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario.