IEEE ICASSP 2024 workshop

Invited Speakers

SASB 2024 - Seoul, South Korea

Keynote talks will provide great insights on the current trends in SSL for audio, speech and beyond from renowned scientists within the community. Each keynote will last for 30 minutes.

To maximize discussions among the attendees, we will also host a panel, that will feature 3 invited speakers giving a 20 minutes presentation each. At the end of the 3x20 minutes, 30 minutes will be devoted to a proper discussion between the audience, the organizers and the three experts.

Feeling shy to ask for the microphone? This year you can also ask your questions via form [click here].

Nancy F. Chen - A*STAR (keynote speaker)

SeaEval for Multilingual Foundation Models: 

From Cross-Lingual Alignment to Cultural Reasoning

Abstract: We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we examine the brittleness of foundation models in the dimensions of semantics and multilinguality. Our investigations encompasses both open-source and proprietary models, shedding light on their behavior in classic NLP tasks, reasoning, and cultural contexts. Notably, (1) Most models respond inconsistently to paraphrased instructions. (2) Exposure bias pervades, evident in both standard NLP tasks and cultural understanding. (3) For questions rooted in factual, scientific, or common sense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, many models intriguingly demonstrate inconsistent performance on such queries. (4) Models trained multilingually still lack ``balanced multilingual'' capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for in-depth investigations for multilingual and multicultural evaluations.

Biography: Nancy F. Chen heads the generative AI group at A*STAR. She has served as the program chair of ICLR 2023, 2023 IEEE SPS Distinguished Lecturer, ISCA Board Member, in addition to being listed as 100 Women in Tech in Singapore 2021. She has won numerous professional awards, including the 2020 Procter & Gamble (P&G) Connect + Develop Open Innovation Award, the 2019 L’Oréal Singapore For Women in Science National Fellowship, A*STAR Fellow, and best paper awards from EMNLP, MICCAI, SIGDIAL, APSIPA, IEEE ICASSP. Speech evaluation technology developed by her team is deployed at the Ministry of Education in Singapore to support home-based learning during the COVID-19 pandemic. Nomopai is a spin-off company that uses technology from her lab to make customer agents more confident and empathetic. Prior to working at A*STAR, Dr. Chen worked at MIT Lincoln Laboratory on multilingual speech processing.  

Ann Lee - Meta (keynote speaker)

Self-Supervised Learning in Real-life Speech and Audio Technology 

Abstract: In this talk, I will go over two real-life systems that we released, namely Seamless, the first massively multilingual, real-time and expressive speech-to-speech translation (S2ST) system, and Audiobox, an audio generative model unifying speech, sound and music modalities, and explain how self-supervised learning (SSL) is critical in enabling such large-scale systems. SSL is applied in these systems in three ways. For Seamless, first, speech encoder pre-training allows us to support translation from many low-resource languages where there is limited labeled training data available. Second, the use of discrete SSL units as the target for the S2ST system makes both model training and inference more efficient. Lastly, for Audiobox, generative audio pre-training allows us to increase the training data scale, which is the key to generalization, and build a foundational model that is flexible and can be fine-tuned for any downstream audio generation tasks.

Biography: Ann Lee is a research manager at AI at Meta (FAIR), supporting teams that have built and released speech translation technology, including the first speech translator for real-world unwritten language and the first real-time expressive speech translator, and received mainstream media coverage globally. Her current research interests are in speech translation and expressive speech generation. Before joining Meta, she received her Ph.D. degree from MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) in 2016.

Sanjeev Khudanpur - Johns Hopkins University (keynote speaker)

What Will It Take to Get Past the SSL Hype?

Abstract: All fields in science and technology occasionally go through periods of transformative change.  Speech and audio processing clearly appears to be in the middle of one such period today due to the use of "all neural" model architectures and innovative “learning” techniques for almost all signal transformation tasks: speech enhancement, transcription, paralinguistic labeling (e.g. speaker, language, emotion), audio event detection, and so on.  Such periods also offer an opportunity to reflect on what questions may need to be answered to get past the unavoidable hype, and start focussing on research to address challenges that still remain.  For instance, (i) Can we clearly articulate what we do and don’t understand about SSL representations?  (ii) What theoretical and empirical tools do we need to fully understand SSL representations?  (iii) Are the accepted paradigm of training-tuning-testing partitions (including the SUPERB benchmark) obsolete?  (iv) What important applications are not well served by existing SSL representations, and why?  This presentation does not aim to answer these questions, only to frame them and hear what audience members think!

Biography: Sanjeev Khudanpur is a founding member of the Johns Hopkins University Human Language Technology Center of Excellence. He has a secondary appointment in the Department of Computer Science. Since 2022, Sanjeev is the Center Director of AI2AI, the JHU + Amazon Initiative for Interactive AI. His research interests are in the application of information theoretic and statistical methods to human language technologies, including automatic speech recognition, machine translation, information retrieval and natural language processing. He organizes the annual Johns Hopkins Summer Workshops to advance the greater research agenda of this field. Sanjeev received a B.Tech in Electrical Engineering from the Indian Institute of Technology, Bombay, in 1988, and a Ph.D. in Electrical Engineering from the University of Maryland, College Park, in 1997. Since 1996, he has been on the faculty of Johns Hopkins University. Until June 2001, he was an Associate Research Scientist in the Center for Language and Speech Processing and, from July 2001 to June 2008, an Assistant Professor in the Department of Electrical and Computer Engineering and the Department of Computer Science; he became an Associate Professor in July 2008.

Joon Son Chung - School of Electrical Engineering, KAIST (panel speaker)

Multi-modal learning of audio representations

Abstract: Supervised learning with deep neural networks has brought phenomenal advances to audio and speech recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision. We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion. Our work utilizes unlabeled audio and video for self-supervised learning of audio and speech representations. The learnt representations can be used for speech-related downstream tasks such as speech recognition, speaker recognition and lip reading, as well as general sound retrieval and localization.

Biography: Joon Son Chung is an assistant professor at the School of Electrical Engineering, KAIST, where he is directing the Multimodal AI Lab. Previously, he was a research lead at Naver Corporation, where he managed the development of speech recognition models. He received his bachelor’s and PhD degrees from the University of Oxford. He has published at top conferences such as ICASSP, Interspeech, ICCV, ECCV and CVPR, and he has been the recipient of best paper awards at Interspeech and ACCV. His research interests include speech recognition, speaker recognition and multimodal learning.

Mark A. Hasegawa-Johnson - University of Illinois (panel speaker)

Unsupervised and Self-Supervised Learning in Theory and Practice

Abstract: Unsupervised learning, in speech applications, has come to refer to the task of learning a mapping from modality X to modality Y using only unpaired examples.  If there is a one-to-one mapping between acoustic units and phonemes, and if the language model sufficiently distinguishes phonemes, then unsupervised speech recognition can learn the mapping using classical code-breaking techniques.  This talk will recap a key theoretical result showing that, with probability 1, natural languages have phonotactic language models that can be learned using unsupervised speech recognition.  The requirement that acoustic units should have a one-to-one mapping with phonemes has not yet been satisfied, but several interesting advances have been made.  First, it is possible to use unsupervised ASR to learn an unsupervised text-to-speech system.  Second, similar methods may be used to learn an unsupervised translation from speech to spoken-order sign language. Third, several experiments, with several types of data (multi-lingual data, child data, and data with a variety of speech disabilities), show that self-supervised pre-training can be improved by reducing the divergence between the data distributions observed during pre-training and test.

Biography: Mark A. Hasegawa-Johnson received his Ph.D. degree from MIT, was an NIH post-doctoral fellow at UCLA from 1996-1999, and joined the Illinois ECE faculty in 1999.  His research converts facts about speech production into low-resource transfer learning algorithms that can be used to make speech technology more fair, more inclusive, and more accessible.  His research has been featured recently in WSJ's The Future of Everything, in the Young Onset Parkinson's Network, and in a special report by the OECD.  Dr. Hasegawa-Johnson is a Fellow of the IEEE, of the Acoustical Society of America, and of the International Speech Communication Association, and he is currently Deputy Editor of the IEEE Transactions on Audio, Speech, and Language Processing.

Kyungmin Lee - Samsung Research (panel speaker)

Self-Supervised Learning for the Commercial Voice Recognition Systems

Abstract: Self-Supervised Learning (SSL) is a new unsupervised learning technology that can learn speech recognition models using an audio-only corpus and has recently become popular in the field of speech recognition. In particular, it is being actively researched in terms of large-scale research utilizing millions of hours of data because training data can be built at lower cost compared to supervised learning methods. This talk introduces what to expect from SSL in terms of commercial servers/on-device Automatic Speech Recognition (ASR) systems and how to apply it to the systems. First, we briefly explain the technological changes of SSL by year and review the models with a focus on their applicability in commercial settings. Among the SSL algorithms, the BERT-Style Random-projection Quantizer (BEST-RQ) is reviewed, and its application to shallow non-casual conformer is explained. Afterward, we conclude this talk by discussing the considerations when using SSL from both the data and efficiency perspectives.

Biography: Kyungmin Lee received his B.S. degree in Computer Software from Myongji University in 2010 and his M.S. degree in Medicine from Seoul National University in 2012. After his Master's, he joined Samsung Electronics in 2012. He earned his Ph.D. degree in Engineering from Seoul National University in 2022 through part-time academic training supported by the company. He is currently a Staff Engineer at Samsung Research, where his research focuses on developing neural speech recognition systems for both large-scale cloud services and lightweight on-device solutions.