Description of the special session
Significant strides in performance have been made by large-scale automatic speech recognition (ASR) models, resulting in low error rates for high-resource languages. This success is attributed to their training in a semi/unsupervised fashion on datasets that predominantly feature a concentrated set of languages [1]. With this session, we aim to encourage researchers to propose approaches which can improve the performance of under-utilized languages with large-scale pretrained ASR models.
This session places a spotlight on languages with limited representation in training datasets of large-scale ASR models. While high-resource language performance can be continuously improved with scaling datasets and models [2,3,4], the same approach cannot be readily applied to under-represented languages, typically characterized by their low-resource nature. Therefore, a greater emphasis on technical advancements is necessary to achieve low error rates for under-represented languages.
Building upon the success of previous Interspeech benchmarks like SUPERB [5] and ML-SUPERB [6], this session extends the exploration to tackle the challenges linked with languages lacking prominence in the training data of pretrained models. Notably, there exists a substantial variance in performance across languages in multilingual benchmarks such as Fleurs [2]. Our strong encouragement goes to researchers, urging them to concentrate their efforts on the under-represented languages and contribute technical innovations aimed at enhancing ASR performance in these specific linguistic contexts.
Additionally, we extend an invitation to researchers to contribute new datasets in under-represented languages, fostering collaboration within the research community and enabling substantial advancements in this domain.
References
[1] Rouditchenko, A., Khurana, S., Thomas, S., Feris, R., Karlinsky, L., Kuehne, H., Harwath, D., Kingsbury, B., Glass, J. (2023) “Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages.” Proc. INTERSPEECH 2023, 2268-2272, doi: 10.21437/Interspeech.2023-1061
[2] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." International Conference on Machine Learning. PMLR, 2023.
[3] Pratap, Vineel, et al. "Scaling speech technology to 1,000+ languages." arXiv preprint arXiv:2305.13516 (2023).
[4] Zhang, Yu, et al. "Google usm: Scaling automatic speech recognition beyond 100 languages." arXiv preprint arXiv:2303.01037 (2023).
[5] Yang, S.-w., Chi, P.-H., Chuang, Y.-S., Lai, C.-I.J., Lakhotia, K., Lin, Y.Y., Liu, A.T., Shi, J., Chang, X., Lin, G.-T., Huang, T.-H., Tseng, W.-C., Lee, K.-t., Liu, D.-R., Huang, Z., Dong, S., Li, S.-W., Watanabe, S., Mohamed, A., Lee, H.-y. (2021) “SUPERB: Speech Processing Universal PERformance Benchmark.” Proc. Interspeech 2021, 1194-1198
[6] Shi, J., Berrebbi, D., Chen, W., Hu, E.-P., Huang, W.-P., Chung, H.-L., Chang, X., Li, S.-W., Mohamed, A., Lee, H.-y., Watanabe, S. (2023) “ML-SUPERB: Multilingual Speech Universal PERformance Benchmark.” Proc. INTERSPEECH 2023, 884-888, doi: 10.21437/Interspeech.2023-1316
Timeline (as per Interspeech 24 - https://interspeech2024.org/)
Paper Submission Deadline 2 March 2024
Paper Update Deadline 9 March 2024
Paper Acceptance Notification 6 June 2024