For track 1, we directly use data from EnvSDD. Specifically, we first collect real samples from various open-source datasets, including UrbanSound8K, TAU UAS 2019 Open Dev, TUT SED 2016, TUT SED 2017, DCASE 2023 Task7 Dev and Clotho, covering different real-life scenarios. All audio recordings are resampled to 16 kHz and split into 4s clips. Then, as shown in Table 1, we use 5 TTA models and 2 ATA models for generating the deepfake sound clips. Table 2 shows the specific statistics of the dataset for track 1. For training and validation splits, we use G01 to G04 for generating the deepfake data, while G05 to G07 are used for evaluation and test. The evaluation set is intended for participants to select their models during the progress phase, while the test set is used for the final ranking in the evaluation phase. Specifically, the evaluation set is a randomly sampled subset of the test set, which ensures that the performance measured on the evaluation set remains aligned with the test distribution, while still preserving the integrity of the final assessment.
Table 1: TTA and ATA models included in ESDD challenge
Table 2: Statistics of data in track 1
For track 2, we combine the training and validation sets from track 1 with the black-box data, where the latter accounts for only 1% of the combined data. Table 3 shows the statistics of the black-box data. The scale of the black-box data is significantly smaller compared to that of track 1.
Table 3: Statistics of the black-box data in track 2