Following EnvSDD, we develop two systems as the baselines, namely AASIST and BEATs+AASIST, respectively.
AASIST is an end-to-end system that uses a novel heterogeneou stacking graph attention machanism to learn acoustic features, which has been applied in various speech and singing voice deepfake detection challenges.
Building on AASIST, BEATs+AASIST incorporates BEATs as the front-end, to extract high-level acoustic representations, achieving better performance than AASIST on EnvSDD.
Results of the baselines are shown in the following table. BEATs+AASIST outperforms AASIST. However, both the two systems has a significant drop in performance on the test sets, highlighting the challenges of ESDD in unseen scenarios.
Codes and models are available at: https://github.com/apple-yinhan/ESDD-Challenge
Table: Performance of the baselines in track 1 and track 2