SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Abstract: Speech and speaker recognition systems are employed in a variety of applications, from personal assistants to telephony surveillance and biometric authentication. The wide deployment of these systems has been made possible by the improved accuracy in neural networks. Like other systems based on neural networks, recent research has demonstrated that speech and speaker recognition systems are vulnerable to attacks using manipulated inputs. However, as we demonstrate in this paper, the end-to-end architecture of speech and speaker systems and the nature of their inputs make attacks and defenses against them substantially different than those in the image space. We demonstrate this first by systematizing existing research in this space and providing a taxonomy through which the community can evaluate future work. We then demonstrate experimentally that attacks against these models almost universally fail to transfer. In so doing, we argue that substantial additional work is required to provide adequate mitigations in this space.

Link to Full Paper: https://arxiv.org/abs/2007.06622

Note: Below we provide tables systemizing the attacks and defenses in the space of speech and speaker recognition systems. We will continue to update these tables frequently. Please contact Hadi Abdullah if you have any questions.

To cite our work:

@INPROCEEDINGS{abdullah2020sok,

author={\textbf{Abdullah}, \textbf{Hadi} and Warren, Kevin and Bindschaedler, Vincent and Papernot, Nicolas and Traynor, Patrick},

booktitle={IEEE Symposium on Security and Privacy (IEEE S\&P)},

title={{SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems}},

year={2021}

}

Taxononmy (Attacks)

Table II and III: The table shows the current progress of the adversarial attacks against ASR and SI systems. “?”: Authors provide no information in paper. “✓”: Will work. “P,W,S” = Phoneme, Word, Sentence. “L,A,T” = Over-Line, Over-Air, Over-Telephony-Network. We sent each of the authors of the above papers emails regarding their papers and have included the responses with “✗” in the table. “✗?”: Authors did not test it and are not sure if it will work. “✗+”: Authors did not test it and believe it will work. “✗-”: Authors did not test it and believe it will not work. “✗”: Authors did not respond to correspondence but we believe it will not work.

Taxonomy (Defenses)

Table IV: The table provides an overview of the current defenses for ASR and SI systems. “?”: Authors provide no information in paper. “✓”: Does not work or has not been demonstrated. “✗”: Will work. “P,W,S” = Phoneme, Word, Sentence. “L,A,T” = Over-Line, Over-Air, Over-Telephony-Network. “ADPT, N-ADPT” = effective against Adaptive Attacker, effective against Non-Adaptive Attacker.

SoK: The Faults in our ASRs: An Overview of Attacks against

Automatic Speech Recognition and Speaker Identification Systems

Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor