Invited Talks

Keynote talks will provide great insights on the current trends in SSL for audio, speech and beyond from renowned and impactful scientists within the community. Each keynote will last for 30 minutes.

To maximize discussions among the attendees, we decided to turn invited talks into two panels. More precisely, each panel will feature 3 invited speakers giving a 20 minutes presentation. At the end of the 3x20 minutes, 30 minutes will be devoted to a proper discussion between the audience, the organizers and the three experts.

Hung-Yi Lee - National Taiwan University (keynote speaker)

Advancing Universal Speech Models Through Self-Supervised Learning:

Progress, Challenges, and Future Directions

Abstract: Speech, teeming with rich and hierarchical information, offers different needs for diverse tasks. Certain tasks, like speech recognition, necessitate content extraction while discarding speaker-specific information. Conversely, tasks like speaker recognition demand the extraction of speaker information and dismissal of content. But is it possible to develop a universal speech model capable of addressing a myriad of speech tasks? This talk starts with the introduction of the Speech Processing Universal PERformance Benchmark (SUPERB), a benchmark and leaderboard designed to gauge the performance of Self-Supervised Learning (SSL) models across an extensive array of speech-processing tasks. Empirical results on SUPERB indicate that SSL representations exhibit generalizability across various speech-processing tasks. I will also shed light on the most recent advances and discoveries in the realm of SSL models for speech processing. Lastly, the talk will venture into the prospective directions for further advancements in this field.

Biography: Hung-yi Lee (李宏毅) is an associate professor of the Department of Electrical Engineering of National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He owns a YouTube channel teaching deep learning in Mandarin with about 100k Subscribers.

David Harwath - University of Texas at Austin (keynote speaker)

Multimodal and Multilingual Self-Supervised Learning for Speech and Audio

Abstract: In this talk, I will give an overview of my group's recent work on developing self-supervised models for speech and audio processing. I will first present Transformer-based models capable of discovering structure (words and sub-word units) in the speech signal by utilizing a visual grounding objective that trains the model to associate input speech waveforms with visual images that they describe. Next, I will demonstrate multilingual variants of these models that can also be fused with pre-trained image-text models such as CLIP in order to semantically align speech in one language with text in another language without the need for parallel language data. Finally, I will present our work on efficient training of self-supervised models for audio event detection, where we achieve more than a 3x speedup over a self-supervised version of the popular Audio Spectrogram Transformer (AST) model.

Biography: David Harwath is an assistant professor in the computer science department at UT Austin. His research focuses on multimodal, self-supervised learning algorithms for speech, audio, vision, and text. Under the supervision of James Glass, his doctoral thesis introduced models for the joint perception of speech and vision. This work was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).

Ankur Bapna - Google Brain (keynote speaker)

Improving Self-Supervised Models of Speech by learning from text and NLP

Abstract: Despite the differences between the text and speech modalities, self-supervised speech models have seen remarkable improvements through cross-modal transfer learning from text, and the application of successful self-supervised learning techniques from NLP. This talk will delve into our explorations aimed at improving self-supervised speech representations by building joint models of speech and text, also going over some of the techniques and intuitions from NLP that carried over well to Speech, and others that haven’t. Next, I will explore some practical applications of these models, with a focus on Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) for very low-resource languages, learned metrics for speech synthesis evaluation, multimodal representation analysis, and their parallels to analogous work in the text setting. Lastly, I will attempt to draw connections between contemporary research trends in Speech SSL and NLP, and share my perspective on future research.

Biography: Ankur is a Software Engineer at Google Brain working on multilingual Natural Language Processing and Speech Understanding. Previously he was on the Google Translate team, building 'zero-resource' translation models for under-represented languages. Before joining Google, Ankur completed his Masters in Electrical Engineering, with a focus on Machine Learning and Optimization from Stanford. His current research interests include NLP and Speech processing for under-represented languages and joint models for speech and text understanding and generation.

Karen Livescu - Toyota Technological Institute at Chicago (panel speaker)

What Do Self‐Supervised Speech Representation Models Know? A Layer‐Wise Analysis

Abstract: Self-supervised speech representations have become ubiquitous in speech processing over the past few years. They have both improved the state of the art and made it feasible to learn speech models with very little labeled data. However, it is not well understood what linguistic information is encoded in pre-trained models and how best to apply them to downstream tasks. In this talk I will describe recent work that begins to build an understanding of the layer-wise information learned by pre-trained speech models. We consider a number of popular pre-trained models and investigate the extent to which their layers encode spectral, phonetic, and word-level information. The results of these analyses also suggest some ways to improve or simplify the application of pre-trained models for downstream tasks.

Biography: Karen Livescu is a Professor at TTI-Chicago. She completed her PhD at MIT in 2005. She is an ISCA Fellow and an IEEE Distinguished Lecturer. She has recently served as a program chair for ICLR, Interspeech, and ASRU, and is an Associate Editor for TACL and IEEE T-PAMI. She works on a variety of topics in speech and language processing and machine learning. Her group's recent work includes multi-view and self-supervised representation learning, acoustic word embeddings, spoken language understanding, and work on low-resource languages including spoken, written, and signed languages.

Odette Scharenborg - Delft University of Technology (panel speaker)

Building speech technology for unwritten languages using visual information

Abstract: In this talk, I will provide an overview of several speech technology applications for unwritten languages, i.e., languages that do not have a common writing system, that have been developed in recent years. These applications bypass the need for textual information, instead using other information sources, mainly images, to learn the mapping between speech and images. I will focus on two questions: 1) What are the possibilities and limitations of such technologies? 2) What information do the speech representations learned using visual grounding contain?

Biography: Odette Scharenborg is an Associate Professor and Delft Technology Fellow at Delft University of Technology. She has an interdisciplinary background in automatic speech recognition and psycholinguistics, and uses knowledge from how humans process speech to develop inclusive automatic speech recognition systems that are able to recognise speech from everyone, irrespective of how they speak or the language they speak. Since 2017, she is on the Board of the International Speech Communication Association (ISCA), where she currently serves as Vice-President. From 2018-2021, she was on the IEEE Speech and Language Processing Technical Committee. Since 2018, she is a Senior Associate Editor of IEEE Signal Processing Letters. In 2025, she will be the General Chair of Interspeech Rotterdam. Odette is an active proponent of diversity and inclusion and is involved in many national and international initiatives to promote diversity, gender equality, and inclusion.

Emmanuel Dupoux - Ecole des Hautes Etudes en Sciences Sociales (panel speaker)

Biography: Emmanuel Dupoux is full professor at the Ecole des Hautes Etudes en Sciences Sociales (EHESS), directs the Cognitive Machine Learning team at the Ecole Normale Supérieure (ENS) in Paris and INRIA (www.syntheticlearner.com) and is currently a part time scientist at Facebook AI Research. His education includes a PhD in Cognitive Science (EHESS), a MA in Computer Science (Orsay University) and a BA in Applied Mathematics (Pierre & Marie Curie University, ENS). He is the recipient of an Advanced ERC grant, the organizer of the Zero Resource Speech Challenge (2015, 2017, 2019) and the Intuitive Physics Benchmark (2019).

Themos Stafylakis - Omilia - Conversational Intelligence (panel speaker)

Extracting speaker and emotion information from self-supervised speech models

Abstract: The advent of pretrained self-supervised models has impacted speaker and emotion recognition. In this talk, we will present some recently proposed methods for extracting speaker and emotion recognition from such pretrained models. We will also consider the use of multilayer neural nets (e.g. TDNN) as backend models for self-supervised models, and whether more shallow backends can perform equally well. Finally, we will discuss self-training approaches, where the self-supervised model is trained from scratch on the target dataset.

Biography: Dr. Themos Stafylakis received his PhD in Speaker Diarization for Broadcast News from National Technical University of Athens, Greece in 2011, his M.Sc. in Communication and Signal Processing from Imperial College London, UK in 2005 and his B.Eng. from National Technical University of Athens, Greece in 2004. In 2011 he joined CRIM (Canada) as a post-doc researcher on speaker recognition. In 2016 he joined the Computer Vision Laboratory at University of Nottingham (UK), as Marie Curie Research Fellow. He is currently head of Machine Learning and Voice Biometrics at Omilia - Conversational Intelligence (Greece) and a visiting researcher at Brno University of Technology (Czechia). His main research interests are audiovisual speech and speaker recognition, spoken language understanding and machine learning.

Shinji Watanabe - Carnegie Mellon University (panel speaker)

Attempts to reproduce large pre-trained models on an academic computing scale

Abstract: The use of large pre-trained models in speech processing has become increasingly promising with the success of self-supervised learning (SSL) such as Wav2vec2.0 and HuBERT, as well as web-scale large-supervised speech models like Whisper. SSL has proven to outperform existing models in various speech processing tasks, while Whisper excels in prompting abilities with its high ASR and speech translation performance. However, these models are not easily reproducible by other institutions due to the lack of public implementations, corpora, or limited resources. For instance, they require vast amounts of training data and large model sizes, making it challenging for institutions with fewer computing resources to pre-train them. In this talk, we will discuss our efforts to reproduce large pre-trained models like HuBERT and Whisper on an academic computing scale. We reviewed multiple implementations and successfully reproduced a similar model to HuBERT large, achieving comparable SUPERB scores. Additionally, we will share our ongoing approach to reproducing Whisper by addressing both implementation and data collection perspectives.

Biography: Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia institute of technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Prior to the move to Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published more than 300 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He serves as a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP).

Sanjeev Khudanpur - Johns Hopkins University (panel speaker)

What Will It Take to Get Past the SSL Hype?

Abstract: All fields in science and technology occasionally go through periods of transformative change. Speech and audio processing clearly appears to be in the middle of one such period today due to the use of "all neural" model architectures and innovative “learning” techniques for almost all signal transformation tasks: speech enhancement, transcription, paralinguistic labeling (e.g. speaker, language, emotion), audio event detection, and so on. Such periods also offer an opportunity to reflect on what questions may need to be answered to get past the unavoidable hype, and start focussing on research to address challenges that still remain. For instance, (i) Can we clearly articulate what we do and don’t understand about SSL representations? (ii) What theoretical and empirical tools do we need to fully understand SSL representations? (iii) Are the accepted paradigm of training-tuning-testing partitions (including the SUPERB benchmark) obsolete? (iv) What important applications are not well served by existing SSL representations, and why? This presentation does not aim to answer these questions, only to frame them and hear what audience members and the other panelists think!

Biography: Sanjeev Khudanpur is a founding member of the Johns Hopkins University Human Language Technology Center of Excellence. He has a secondary appointment in the Department of Computer Science. Since 2022, Sanjeev is the Center Director of AI2AI, the JHU + Amazon Initiative for Interactive AI. His research interests are in the application of information theoretic and statistical methods to human language technologies, including automatic speech recognition, machine translation, information retrieval and natural language processing. He organizes the annual Johns Hopkins Summer Workshops to advance the greater research agenda of this field. Sanjeev received a B.Tech in Electrical Engineering from the Indian Institute of Technology, Bombay, in 1988, and a Ph.D. in Electrical Engineering from the University of Maryland, College Park, in 1997. Since 1996, he has been on the faculty of Johns Hopkins University. Until June 2001, he was an Associate Research Scientist in the Center for Language and Speech Processing and, from July 2001 to June 2008, an Assistant Professor in the Department of Electrical and Computer Engineering and the Department of Computer Science; he became an Associate Professor in July 2008.