IEEE ICASSP 2024 workshop

Self-supervision in Audio, Speech and Beyond

14th of April 2024, Seoul, South Korea

Workshop description

Self-Supervised Learning (SSL) of latent representations is transforming deep learning powered technologies. In the speech and audio domains, most state-of-the-art systems rely on large Transformer neural networks pretrained on thousands of hours of signal following various methods such as contrastive or multitask learning.

Recent top-tier conferences from the field have seen an exponential increase in the number of accepted articles mentioning self-supervised learning techniques, yet many challenges still prevent a wider adoption of these techniques in real-life speech and audio technologies.

In facts, SSL models currently suffer from critical complexity issues, the lack of a standardized and widely adopted evaluation protocol, dramatic biases and robustness concerns as well as disconnection with others closely related modalities (e.g. text or video).

Throughout a schedule that maximizes interactions within the audience via multiple panels and a poster session, the Self-supervision in Audio, Speech and Beyond (SASB) workshop aims at fostering interactions from the whole SSL community including experts from different modalities.

SASB will act as a dedicated place for the SSL community to properly frame the building of a technology currently appearing as a groundbreaking solution for the audio, speech and beyond communities.

More details

The ongoing success of deep learning techniques depends on the quality of the representations automatically discovered from data [4]. These representations must capture important underlying structures from the raw input, e.g. intermediate concepts, features, or latent variables that are useful for the downstream task. While supervised learning using large annotated corpora can leverage useful representations, collecting large amounts of annotated examples is costly, time-consuming, and not always feasible. This is particularly problematic for a large variety of applications. In the speech domain, for instance, there are many low-resource languages, where the progress is dramatically slower than in high-resource languages such as English. Moreover, annotations are often underspecified for many potential downstream applications, and the related supervised representations might be biased towards the task they are trained on, limiting their exportability to other applications [25].

Natural ways to mitigate these issues are unsupervised [5] and self-supervised learning [12, 19, 20, 15]. Following its increasing popularity within the computer vision community, many attempts have been done to extend self-supervised learning to discover audio and speech representations [18, 11, 21, 23, 22, 24, 3, 16]. Recent systems including wav2vec 2.0, HuBERT or WavLM [3, 16, 9] achieved unprecedented performance on highly competitive tasks including speech and speaker recognition, speech translation, emotion recognition, intent detection and many others. Nevertheless, applying self-supervised learning to speech remains particularly challenging. Speech signals, in fact, are not only high-dimensional, long, and variable-length sequences, but also entail a complex hierarchical structure that is difficult to infer without supervision (e.g. phonemes, syllables, words). Moreover, speech is characterized by an important variability due to different speaker identities, accents, recording conditions and noises that highly increase the level of complexity. Hence, novel architectures must continualy be invented to further push the state-of-the-art as well as to give low-resources languages highly-competitive speech technologies, and we believe that the SASB workshop will play a key role in facing those directions.

Then, the complexity of the speech and audio signals not only reflect in the nature of the neural architecture that must be highly tailored for the considered domain, but also with the need for extreme compute resources to train such models. For instance, it is quite common to see the largest available models to be trained for weeks or months on hundreds of high-end GPU, each worth many thousands of euros [1, 13]. Therefore, and perhaps unfortunately, speech systems are rapidly moving away from accessible paradigms to niche foundation models [6] that only a few extremely large companies can create due to the huge need for computer power and data. Hence, and despite an astonishing jump in performance on the short-term, large-scale SSL models could quickly become a major barrier for academic research as it already is impossible for the vast majority of the institutions to train them, hence relying on two or three companies. Very few attemps have been made to solve this issue [10], and we hope that the SASB workshop will foster interest around the efficiency of SSL models that appears as a critical topic in a world facing climate change.

The evaluation of SSL models also suffers from critical issues that remain to be solved. In particular, and conversely to traditional speech tasks, e.g. speech recognition, no real protocol is widely accepted by the community to assess the universality of a SSL representation. Recently, SUPERB and LeBenchmark [27, 13] proposed benchmarks to normalize the evaluation protocol. SUPERB, in particular, is being increasinly adopted by the community. Unfortunately, both benchmarks suffer from a lack of complexity in the adopted datasets as most of latest SSL models achieve near to perfect results, making it very hard to distinginsh their performance in a potential real-world scenario. Finally, and as demonstrated with SUPERB and LeBenchmark, the current trend is to evaluate SSL models solely based on their downstream performance, hence necessitating potentially dozens or hundreds of costly fine-tunings. An other direction, represented with a very scarce litterature [28], proposes to measure the quality, robustness of a given representation without downstream fine-tuning, speeding-up drastically the development process. The SASB workshop will offer a workplace for the SSL community to discuss actively how our models should be evaluated.

Social and technical biases for SSL models applied to natural language processing (NLP) are an active field of research [7]. For instance, gender biases are also found for machine translation tasks [8], as well as in facial recognition systems [8] and ASR [14]. Interestingly however, and despite the clear growing adoption of SSL in the speech community, the inclusiveness and robustness of SotA models remains a completely open question. More precisely, speech SSL architectures currently struggle at encompassing the information from the diversity of population and acoustic environments making them potentially unfair such as for large NLP SSL models [7] or unreliable with realistic conditions (e.g. noise, multiple speakers, variety of accents, genders equity ...) [17].

Finally, the speech signal itself might not be sufficient to achieve the seaked universal representation. Data2vec [2], for instance, have demonstrated that combining speech, image and text during a single SSL pre-training could lead to massive improvement over a wide variety of different tasks originating from all the domains. It is indeed natural to consider multimodality as the next step for SSL. The later could either be done at the pre-training level, such as in Data2vec, or at the fine-tuning stage, by combining different representation in a final model [26] . Nevertheless, achieving multimodal SSL is a long-term goal that we should tackle step by step as a community. With SASB, we will hope to encourage original research in the direction of SSL for audio or speech combined with an other modality such as audio-visual SSL or audio-text SSL.

[1] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296, 2021.

[2] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022.

[3] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Informa- tion Processing Systems, 33:12449–12460, 2020.

[4] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pages 17–36, 2012.

[5] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise train- ing of deep networks. In Advances in neural information processing systems, pages 153–160, 2007.

[6] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

[7] Rishi Bommasani, Drew A. Hudson, and et al. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021.

[8] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Sorelle A. Friedler and Christo Wilson, editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 77–91. PMLR, 23–24 Feb 2018.

[9] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 2022.

[10] Po-Han Chi, Pei-Hung Chung, Tsung-Han Wu, Chun-Cheng Hsieh, Yen-Hao Chen, Shang- Wen Li, and Hung-yi Lee. Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE, 2021.

[11] Jan Chorowski, Ron J Weiss, Samy Bengio, and A ̈aron van den Oord. Unsupervised speech rep- resentation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12):2041–2053, 2019.

[12] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2051–2060, 2017.

[13] Solène Evain, Ha Nguyen, and et al. Task agnostic and task specific self-supervised learning from speech with lebenchmark. In NeurIPS 2021, August 2021.

[14] Mahault Garnerin, Solange Rossato, and Laurent Besacier. Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance. In the 1st International Workshop, pages 3–9, Nice, France, October 2019. ACM Press.

[15] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representation, 2018.

[16] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdi- nov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.

[17] Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, and Michael Auli. Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training. ArXiv, abs/2104.01027, 2021.

[18] Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu, R Channing Moore, and Rif A Saurous. Unsupervised learning of semantic audio represen- tations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 126–130. IEEE, 2018.

[19] Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting image rotations. 2018.

[20] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.

[21] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.

[22] Santiago Pascual, Mirco Ravanelli, Joan Serr`a, Antonio Bonafonte, and Yoshua Bengio. Learn- ing problem-agnostic speech representations from multiple self-supervised tasks. Proc. of In- terspeech, 2019.

[23] Mirco Ravanelli and Yoshua Bengio. Learning speaker representations with mutual informa- tion. Proc. of Interspeech, 2019.

[24] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio. Multi-task self-supervised learning for robust speech recognition. Proc. of ICASSP, 2020.

[25] Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pages 1–4, 2005.

[26] Shamane Siriwardhana, Tharindu Kaluarachchi, Mark Billinghurst, and Suranga Nanayakkara. Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access, 8:176274–176285, 2020.

[27] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. Superb: Speech process- ing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021.

[28] Salah Zaiem, Titouan Parcollet, Slim Essid, and Abdelwahab Heba. Pretext tasks selection for multitask self-supervised audio representation learning. IEEE Journal of Selected Topics in Signal Processing, pages 1–15, 2022.