ICASSP 2024 Satellite Workshop (4/15 Monday)

Hands-free Speech Communication and Microphone Arrays
(HSCMA 2024)

Efficient and Personalized of Speech Processing through Data Science

The Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA) represents an ambitious effort to bridge gaps between researchers and industry practitioners, striving to foster innovation, knowledge exchange, and synergies in a collaborative, congenial environment. The genesis of this workshop can be traced back to 2005, originating as a joint endeavor of the AASP TC and SLTC. For nearly two decades, this workshop has sought to find and explore common ground between these two communities, bridging the gap and promoting interdisciplinary collaboration between the two often disparate fields. Over the years, it has grown into a unique platform for sharing cutting-edge ideas and breakthrough practical solutions, aimed at overcoming the prevalent challenges in the industry. It places focus on the imperative roles of hands-free speech communication and microphone arrays in myriad real-world applications, including voice-controlled assistants, teleconferencing systems, and speech recognition in acoustically challenging conditions. The effective capture and processing of speech signals in these settings are fraught with difficulties owing to the existence of environmental noise, reverberation, variations in human speech, and other complex factors. Through this workshop, the HSCMA continues to fortify the links between theory and practice, while encouraging the development of innovative strategies for overcoming the complexities inherent to the field of hands-free speech communication. Recently, both communities have faced challenges and opportunities in making powerful neural network models available for "efficient" on-device processing, while adapting to each individual user's "personal" speech traits (e.g., dialects and non-nativeness) and acoustic environments. In addition, the tradeoff between performance improvement and increased use of personal data becomes trickier, too, from the privacy preservation perspectives. The workshop will be a great opportunity to push the limits of speech processing systems in various hands-free applications scenarios.

HSCMA complements ICASSP 2024's main track by providing a dedicated platform to explore the unique complexities and requirements of hands-free speech communication. While the main conference may cover broader aspects of speech and audio processing, this workshop focuses on the specific challenges and techniques related to hands-free scenarios. For the hands-free applications to work seamlessly in the whole voice communication pipeline, audio signal processing technology is pivotal to improving the quality of the acquired speech signal, while the backend ASR system also needs to be integrated into the front-end modules. In this integrated system, other important research topics also emerge as the complexity of the system can increase while achieving more personalization without breaching privacy is the utmost goal. By focusing on this specific domain, the workshop enhances the overall conference program and provides valuable insights to researchers and practitioners working in the field.

The workshop's main topics encompass various areas of research, including hands-free speech recognition, microphone array processing, beamforming techniques, acoustic echo cancellation, source separation, and multi-channel speech enhancement. These areas represent active research directions that require considerations on the efficiency of the computational method, as the increasingly complex model architectures can hinder the on-device processing of the neural network-based speech processing methods. Moreover, increased concerns about the breach of privacy also limit the usability of cloud-based models, where communication can be the bottleneck to the secure handling of speech data. We call for innovative approaches and novel techniques, especially based on data science and learning algorithms. The workshop will provide a forum for participants to discuss the latest advancements, share experimental results, and explore potential collaborations, ultimately leading to improved hands-free speech communication systems.

Given the multifaceted challenges in hands-free speech communication, interdisciplinary work is crucial for making significant progress. This workshop encourages interdisciplinary collaboration by welcoming contributions that bridge multiple technical areas, such as signal processing, acoustics, machine learning, human-computer interaction, and telecommunications. By fostering synergies among researchers from diverse backgrounds, the workshop promotes the cross-pollination of ideas and the development of comprehensive solutions to hands-free speech communication challenges. The workshop is co-sponsored by both AASP TC and SL TC. In addition, it will be part of the Data Science Initiative as an IEEE Data Science and Learning Workshop series.

To facilitate practical understanding and showcase the state-of-the-art in hands-free speech communication, the workshop welcomes demonstrations of experimental systems and prototypes. Participants will have the opportunity to present and interact with tangible implementations of their research, fostering discussions on implementation challenges, performance evaluation, and real-world applications. These demonstrations contribute to the workshop's interactive and hands-on atmosphere, providing valuable insights and encouraging future directions in the field.

Topics of Interest

Keynote Speakers

Yi Luo Tencent

Title:  Towards training robust and versatile speech front-end systems: data simulation, model design, and task definition

Abstract: Training speech front-end systems to be robust and versatile to real-world scenarios is always challenging. It involves highly-matched data simulation pipelines that cover all the possible data patterns, powerful model designs that perform consistently good across different conditions, and carefully defined tasks that better represents the target application. In this talk, I will share some of the recent progress in Tencent AI Lab about these three aspects, which include FRAM-RIR, an efficient room impulse response generator for fast and realistic on-the-fly data simulation and augmentation, band-split RNN (BSRNN), a versatile model architecture that has shown effective and powerful in various speech and music processing tasks, and various real-world tasks and problems that BSRNN shows its advantage on. Demos on several real-world scenarios will also be presented.

Bio: Dr. Yi Luo is currently a Senior Research Scientist at Tencent AI Lab. He holds a Ph.D. degree in electrical engineering from Columbia University and a bachelor’s degree in computer science from Fudan University. His research focuses are speech and audio processing and understanding, including source separation, speech front-end processing, music signal processing, microphone array processing, and deep neural networks. His work on AI-based speech separation system, Conv-TasNet, has received the 2021 IEEE Signal Processing Society Best Paper Award.

Nancy F. Chen A*STAR

Title: Multimodal, Multilingual Conversational AI Technology for Enhancing Inclusivity in Education  

Abstract: End-to-end modeling and deep learning have significantly advanced human language technology; recent examples include large language models. However, to ensure inclusivity and extend such benefits to more people, there remains substantial work ahead. In this talk, we investigate how to make EdTech more inclusive for language learning applications in four dimensions: (1) Language Diversity, (2) Student Age Groups, (3) Human-Computer Interaction Styles, and (4) Subjectivity in Expert Evaluations. We illustrate how speech science and statistical machine learning can elegantly blend with neural modelling approaches to address technical challenges such as data sparsity, feature bias, and explainability. We will share our experience in developing AI technology to help students learn English, Mandarin Chinese, Malay, and Tamil. These languages span across various linguistic families and possess varying degrees of linguistic resources suitable for computational approaches. Our endeavours have led to  government deployment and commercial spin-offs, serving as valuable case studies for AI's role in cultural and linguistic heritage preservation.

Bio: Nancy F. Chen is a fellow, senior principal scientist, principal investigator, and group leader at I2R (Institute for Infocomm Research), A*STAR (Agency for Science, Technology And Science), Singapore. Her group works on generative AI in speech, language, and conversational technology. Her research has been applied to education, defense, healthcare, and media/journalism. Dr. Chen has published 100+ papers and supervised 100+ students/staff. She has won awards from IEEE, Microsoft, NIH, P&G, UNESCO, L’Oréal, SIGDIAL, APSIPA, MICCAI. She is IEEE SPS Distinguished Lecturer (2023-2024), Program Chair of ICLR 2023, A*STAR Fellow (2023), Board Member of ISCA (2021-2025), and Singapore 100 Women in Tech (2021). Technology from her team has led to commercial spin-offs and government deployment. Prior to A*STAR, she worked at MIT Lincoln Lab while doing a PhD at MIT and Harvard. For more info: http://alum.mit.edu/www/nancychen

Panel Discussion

Berrak Sisman University of Texas at Dallas

Title: Expressive Speech Synthesis: Applications and Future Directions

Abstract: Expressive speech synthesis plays a crucial role in artificial intelligence, encompassing the transformation of text into speech and the manipulation of speech properties like voice identity, emotion, and accents. In this talk, Dr. Sisman will discuss these recent advancements in speech synthesis and voice conversion, their potential, applications, and future directions.

Bio: Dr. Sisman received the PhD degree in Electrical and Computer Engineering from National University of Singapore. She is currently working as a tenure-track Assistant Professor at the Erik Jonsson School Department of Electrical and Computer Engineering at University of Texas at Dallas, where she leads the Speech & Machine Learning Lab. Her research is focused on machine learning, speech synthesis, voice conversion and emotional intelligence. Prior to joining UT Dallas, she was a faculty member at Singapore University of Technology and Design (2020-2022), where she taught machine learning and deep learning courses. She was a Postdoctoral Research Fellow at the National University of Singapore (2019-2020). She was Visiting Researcher at The Centre for Speech Technology Research (CSTR) in University of Edinburgh (2019), and RIKEN Advanced Intelligence Project in Japan (2018). She plays leadership roles in conference organizations and active in technical committees. She has served as the General Coordinator of the Student Advisory Committee (SAC) of International Speech Communication Association (ISCA). She has served as the Area Chair at INTERSPEECH 2021, INTERSPEECH 2022, IEEE SLT 2022 and as the Publication Chair at ICASSP 2022. She has been elected as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from Jan. 2022 to Dec. 2024. 

Wontak Kim Amazon Lab126

Abstract: For Alexa as a virtual assistant, personalized AL approach is highly desirable so that she can catered her response to each user's specific needs. There are many use cases to this but I'll touch upon the kinds of sensing and detection problems relevant to ambient intelligence and how those can unlock more personalized home assistant features.

Bio: Wontak Kim has 20+ years experience in developing audio processing algorithms and hardware system for consumer products. In his current role as Senior Research Manager of audio and data team at Amazon, he manages a team of scientists and engineers developing signal processing and deep learning models for Alexa and Echo related products and features. These include mic array design, beamforming, AEC, noise reduction based speech processing for Alexa ASR as well as voice communication. His team also develops spatial audio processing algorithms for music playback - upmixing, binauralization, crosstalk cancellation etc. The third focus area is ambient intelligence related features- user and device localization, sound classification, room detection and ultra sound based presence. All of these algorithms are based on mix of DSP and ML and to be implemented on local devices therefore presenting challenges of limited compute and memory foot print. Another challenge is the need to deploy them in many different HW designs at scale. For that his team develops sophisticated audio simulation techniques to enable virtual HW and algorithm development. The synesthetic data generation for ML training is a big focus area in the world AI. Before Amazon, he was at Bose for 16 years serving various roles in engineering and management for automotive, home and hearable applications. His technical contributions include virtual audio techniques for remote hearing, smart assistant voice processing and micro speaker acoustic design. He holds MS in Acoustics from Penn State and BS from American University. He gave various talks to facilitate academic and industry collaborations and the recent one includes keynote at ACM IASA workshop 2022.

Bhiksha Raj CMU LTI

Abstract: Privacy and security in Speech

With the increasing popularity and ubiquitousness of speech-based applications and user interfaces, the privacy and security challenges arising from using them have also become increasingly concerning. Each time we use a voice-based service, we expose ourselves to abuse – of undesired inferences made from our voice, of being tracked, impersonated, and worse. In this talk I will briefly go over these challenges and some of the work we have done in this area, over the years.

Bio: 
Bhiksha Raj is a professor in the school of Computer Science, at Carnegie Mellon University, USA, with a primary affiliation to the Language Technologies Institute, and secondary affiliations to the Machine Learning and Electrical and Computer Engineering departments. He is currently also a visiting professor at the Mohammed bin Zayed University of AI in Abu Dhabi. Prof. Raj’s research interests span speech and audio processing, theoretical aspects of machine learning and deep learning, and privacy and security issues in speech processing. He has authored over 300 peer reviewed papers, several patents, and multiple edited books in these areas. Prof. Raj is a fellow of the IEEE, and a fellow of ISCA.  

The Technical Program

The Venue

COEX, Seoul, Korea. We share the same venue with ICASSP 2024. 

Location: Room 205

The Workshop Day 

4/15 Monday (full day, 08:30—17:30)

Tentative Technical Program

(08:30-08:45) Opening remark 

(08:45-08:50) Oral Session Preparation

(08:50-10:10) Oral Session I

(10:10-10:30) Coffee Break

(10:30-10:50) Oral Session II

(11:00-12:00) Keynote by Nancy F. Chen 

(12:00-13:30) Lunch Break 

(13:30-14:30) Keynote by Yi Luo 

(14:30-15:40) Panel Discussion

(15:40-16:00) Break/poster preparation

(16:00-17:30) Poster Session

(17:30-18:00) Location change CHiME 8 - Pitching session Room 206B

Posters:

Information for Authors

Important Dates

12/20/2023
01/10/2024 submission deadline 

01/12/2024 reviews are assigned

01/31/2024 reviews submission deadline

02/01/2024 authors notification  

02/07/2024 camera-ready submission deadline 

Publications will be in a two-track structure: the archival and non-archival tracks. 

The Archival Track Information

The archival track will accept novel paper contributions, which will be published at IEEE Xplore. Please submit your paper using this link. The papers must follow ICASSP 2024's formatting guideline (4 pages + 1 additional page for references) to be published in IEEE Xplore. All submissions will go through a peer-review process.

The Non-Archival Track Information

We accept non-archival track papers, which will NOT be published through IEEE Xplore, although we will still post them on this website. The non-archival track accepts work-in-progress, already published works, or demonstrations of systems, for those who want to socialize the research ideas with like-minded audiences. Please note that the papers should still be technically sound and follow ICASSP 2024's formatting guideline, while we allow longer page limitations of up to 8 pages (plus one references-only page). For this track, please submit the paper using this link

Organizers

Minje Kim

University of Illinois at Urbana Champaign
Amazon Lab126

Website: https://minjekim.com

Minje Kim is an Associate Professor at the University of Illinois at Urbana-Champaign and a visiting academic at Amazon Lab126, specializing in machine learning models for audio signal processing. Before then, he was an associate professor at Indiana University. He obtained his Ph.D. from the University of Illinois at Urbana-Champaign (2016) and worked as a researcher at ETRI, a national lab in Korea (2006-2011). Minje Kim's contributions and expertise have been recognized through various awards, including the NSF Career Award, IU Trustees Teaching Award, IEEE SPS Best Paper Award, and Richard T. Cheng Endowed Fellowship from UIUC. In addition, he holds editorship roles in journals, such as the Senior Area Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing, Associate Editor for EURASIP Journal of Audio, Speech, and Music Processing, and Consulting Associate Editor for IEEE Open Journal of Signal Processing. He is an IEEE Senior Member and belongs to the IEEE Audio and Acoustic Signal Processing Technical Committee as a member (2018-2023) and the Vice Chair (2024). He was the general co-chair of WASPAA 2023. He actively participates as a reviewer, program committee member, and area chair for major machine learning and signal processing venues. 


Paola Garcia

Johns Hopkins University

Website: https://www.clsp.jhu.edu/faculty/paola-garcia/

Dr. Leibny Paola Garcia Perera is affiliated with Johns Hopkins University. With a strong background in both academia and industry, including Agnitio and Nuance Communications, Dr. Garcia Perera brings extensive research experience to her current role. Dr. Garcia Perera played a vital role in organizing the Self-supervision in Audio, Speech and Beyond (SASB) workshop at ICASSP 2023, fostering interactions within the SSL community and promoting the adoption of SSL techniques in real-life speech and audio technologies. She also led a team of over 20 researchers from renowned laboratories worldwide during the JHU summer workshop 2019, focusing on far-field speech diarization and speaker recognition. Previously, Dr. Garcia Perera worked as a researcher at Tec de Monterrey, Mexico, for ten years and held the role of Marie Curie researcher for the Iris project in Spain. She has also been a visiting scholar at the Georgia Institute of Technology and Carnegie Mellon University. She actively contributes to cutting-edge research as part of the JHU CHiME5, CHiME6, SRE18, SRE19, SRE20, and SRE21 teams, collaborating with DARCLE.org and CCWD. Dr. Garcia Perera's research interests span various areas, including diarization, speech recognition, speaker recognition, machine learning, and language processing, while her recent research activities include children's speech analysis, specifically in child speech recognition and diarization. Her diverse background and involvement in high-profile projects highlight her expertise and dedication to advancing speech and audio processing. At JHU, she continues to contribute to the research community and pursue innovative techniques in speech and language processing.

Jonah Casebeer

Adobe Research

https://jmcasebeer.github.io

Dr. Jonah Casebeer is a Research Scientist at Adobe Research. He holds Ph.D. and bachelor's degrees in Computer Science from the University of Illinois Urbana-Champaign, where he was advised by Paris Smaragdis. His research interests are broadly in machine learning and signal processing for audio. During his studies, he received multiple excellence fellowships from UIUC, and funding from Adobe, Apple, and Amazon.

Sponsored by
Speech and Language Processing Technical Committee (IEEE Signal Processing Society)
Audio and Acoustic Signal Processing Technical Committee (IEEE Signal Processing Society)
Data Science Initiative (IEEE Signal Processing Society)
International Speech Communication Association (ISCA)