We study four widely adopted models in our work, including RawNet2 [1], AASIST [2], Res-TSSDNet [3] and SAMO[4]. To enhance the credibility of our evaluation, we utilized the models released by their respective authors. You can access and download these models via the following links:
RawNet2: https://www.asvspoof.org/asvspoof2021/pre_trained_DF_RawNet2.zip
AASIST: https://github.com/clovaai/aasist/blob/main/models/weights/AASIST.pth
SAMO: https://github.com/sivannavis/samo/blob/main/models/samo.pt
In the tables presented below, a higher False Acceptance Rate (FAR) indicates a greater likelihood of success for an attacker using manipulation attacks to evade the detector. Conversely, a lower F1 score corresponds to poorer detector performance.
To reproduce our results, please refer to the instructions in our repository: https://github.com/CLAD23/CLAD.
1. For environmental noise injection, the abbreviations WD, FS, BR, CO, RA, CT, SN stand for wind, footsteps, breathing, coughing, rain, clock tick and sneezing respectively.
2. For echoes adding, "1k/ .2" means that the delay is 1,000 samples and the attenuation factor is 0.2, and the rest is the same.
3. For fading, ".5/ L" means the ratio is set to 0.5, and using linear fade shape. L, E, Q, H, Lo denote linear, exponential, quarter sinusoidal, half sinusoidal, logarithmic fade shapes respectively.
Here, we provide waveform and spectrogram for an audio deepfake sample, as well as these representations under different manipulation settings.
The rightmost column displays probability scores from detection models, indicating the likelihood of the audio being real. It is evident that all the models effectively identified the original audio as deepfake, but SOTA models failed in detecting manipulated audio under certain manipulation settings.
In this section, we showcase audio samples that have been manipulated using various methods and parameters, and we also provide the detection results obtained from different detection models. In the table, a cross indicates that the model classifies the audio as a deepfake, while a checkmark signifies that the audio is classified as real, indicating a successful evasion of the detection model by the attacker.
[1] Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6369–6373.
[2] Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, BongJin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6367–6371.
[3] Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang. 2021. Towards Endto-End Synthetic Speech Detection. IEEE Signal Processing Letters 28 (2021), 1265–1269.
[4] Siwen Ding, You Zhang, and Zhiyao Duan. 2023. SAMO: Speaker Attractor Multi-Center One-Class Learning For Voice Anti-Spoofing. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 04, 2023. IEEE, Rhodes Island, Greece, 1–5. https://doi.org/10.1109/ICASSP49357.2023.10094704