As shown in Figure 1, this track aims to encourage participants to develop robust ESDD systems for unseen (Text-to-Audio) TTA and (Audio-to-Audio) ATA generators. The primary objective is to promote the creation of detection models that can effectively identify synthetic environmental sounds generated by unseen TTA and ATA systems. By focusing on the unseen-generator scenario, this track aims to simulate practical use cases where the source and generation models of deepfake audio are not known seen during training. We seek to advance research in generalized deepfake detection and promote the design of models with strong cross-generator robustness and adaptability.
Figure 1: Overview of Track 1 (ESDD in Unseen Generators).
Figure 2 illustrates the overview of track 2. The term “black-box” refers to the condition where we have no prior knowledge of the specific generation methods used in testing, which may include any generative paradigms beyond TTA and ATA. The “low-resource” setting indicates that the amount of available black-box training data is severely limited, constituting only 1% of the total training data. Together, this track presents a realistic and challenging scenario that simulates practical deepfake detection under extreme uncertainty and data scarcity.
Figure 2: Overview of Track 2 (Black-Box Low-Resource ESDD).
We employ the equal error rate (EER) as the metric to evaluate the performance of ESDD. Participants are expected to submit a score file in TXT format, containing confidence scores for each segmented sound clip. These scores indicate the system's confidence in determining whether a given sound clip originates from real-life. In practical applications, users may set a decision threshold to convert these scores into binary outputs. As the threshold increases, the false acceptance rate decreases, but the false rejection rate increases. The EER is achieved when these two values are equal, providing a reliable and threshold-independent measure of system performance. A lower EER indicates better detection performance.