I'm a linguist on a team of data scientists, and together we're researching how linguistic theory can be used to strategically improve audio deepfake detection models.
Before I joined the team at UMBC, the team found that you can improve deepfake detection model prediction with hand-coded annotations by expert linguists (Khanjani et. al, 2023). But, the annotations were coarse and poorly-defined into just seven categories:
Khanjani et al., (2023, October). In 2023 IEEE International Conference on Intelligence and Security Informatics (ISI) (pp. 01-06).
A big priority for the data scientists on the team was creating a taxonomy that clearly defined these categories, which was my first order of business.
Based on previously-existing categories, I created this taxonomy:
Unclear feature categories. The rules for how audio features should be categorized were not clearly defined. This can lead to differences in interpretation by individual annotators.
Equal weight on prediction of all features. Some features are more reliable predictors of deepfaked audio than others, but the previous model weighted all anomalous speech features equally. Previously, the feature list included "anomalous presence/absence of burst," but this feature was difficult for the linguists to identify and did not improve predicton accuracy, so I recommended to the team that we drop it.
Poor feature definitions. The previous taxonomy described features imprecisely. As one example, "compressed vowel space" was previously defined as "squashed," which is not a descriptor used in acoustic phonetics research. For other researchers working on deepfake detection, this makes our annotation methods unclear.
The current taxonomy still has some important areas for improvement. Some feature definitions are imprecise, like "tinny." I am coordinating with the lead linguist on the team to eliminate impressionistic descriptors like these. One reason I was hired to this project is for my knowledge of acoustic phonetics and signal processing. The lead linguist has a background in qualitative variationist sociolinguistics, so I translate her ideas into quantifiable metrics for modeling.
Another area of central importance is to make sure that new features we add to the model are strategic. Some areas of speech synthesis are making rapid improvements compared to others, and so resources are best spent on leveraging linguistic features where synthetic speech still has limited accuracy. For my planned next steps, read more here.