The popularity of automatic speech recognition (ASR) systems nowadays leads to an increasing need for improving their accessibility. Handling stuttering speech is an important feature for accessible ASR systems. To improve the accessibility of ASR systems for stutterers, we need to expose and analyze the failures of ASR systems on stuttering speech. Although the speech datasets recorded from stutterers can be used for this purpose, they are not diverse enough to expose most of the failures. Therefore, a methodology for generating stuttering speech as test inputs to test and analyze the performance of ASR systems is needed. However, generating valid test inputs in this scenario is challenging. The reason is that although the generated test inputs should mimic how stutterers speak, they should also be diverse enough to trigger more failures. To address the challenge, we propose Aster, a technique for automatically testing the accessibility of ASR systems. Aster can generate valid test cases by injecting five different types of stuttering. The generated test cases can both simulate realistic stuttering speech and expose failures in ASR systems. Moreover, Aster can further enhance the quality of the test cases with a multi-objective optimization-based seed updating algorithm. We implemented Aster as a framework and evaluated it on four open-source ASR models and three commercial ASR systems. We conduct a comprehensive evaluation of Aster and find that it significantly increases the word error rate, match error rate, and word information loss in the evaluated ASR systems. Additionally, our user study demonstrates that the generated stuttering audio is indistinguishable from real-world stuttering audio clips.
This component first determines the timing of different words to locate and differentiate the words. With the word timing for each seed audio file, this component then identifies the syllable timing for each word.
This component can apply five different mutation strategies to inject stuttering into the original audio file to create the test cases.
This component keeps the test cases which are similar to the original audio but can trigger the different execution results of the ASR systems. Lastly, it uses the distance to the original speech text plus manual checks as the test oracle to capture the failures of the ASR systems under test.