Speech-to-text (STT) consists in directly mapping the speech signal to text, including transcription and translation. Automatic STT has seen an increase in deployment to real application scenarios, including meeting transcription/translation and summary, medial documentation, recognition of product names/acronyms, contact list, music/song/media titles, digital assistant, and place and locations.
In these scenarios, context information, e.g. a description of the targeted named entities (NEs) including person names, is available to customize the STT. The user expects high accuracy on these NEs, even if they rarely appear in the STT model training data.
The quality of the customization is typically measured by precision/recall or error rate reduction of the targeted words, along with error rate changes on all (general) words.
Automatic STT has been performing adequately in general transcription/translation tasks. However, despite typically trained on a large number (and variety) of speech datasets, STT on contextual NE-critical scenarios remains challenging and often inadequate. Errors on NE words can severely impact the performance of downstream applications and are increasingly becoming one of the major obstacles in real-world applications.
The past few years have seen effective modeling strategies for STT contextual customization developed, with multiple sessions devoted on customization, personalization and adaptation at ICASSP/SLT spanning on: shallow fusion, neural network based (e.g. biasing) for encoder, decoder or both, customization with text-only and text with audio pair, training data and recipe, customization with other types of contexts, LLM prompting, and evaluation methods and benchmarks.
The objective of this special session is to provide a great opportunity to the researcher/engineer participants for a focused discussion to develop effective methods to enhance the capabilities of End-to-End STT, audio/speech LLMs, or cascade STT + LLMs contextual customization.
Contextual customization for speech-to-text
Context-aware speech processing
Contextual biasing for speech foundation models
Personalization
Neural biasing
Rare words recogition
Attention-based biasing
Submission Link: Submit Paper
Instructions: We are following the same Author Instructions for ASRU 2025. When submitting, please be sure to select: "SS3. Contextual Customization for Speech-to-Text" as your primary subject area to ensure your paper is properly considered for inclusion.
All deadlines are Anywhere on Earth (AoE), unless otherwise specified.
Paper submissions open: March 28, 2025
Paper submissions due: May 28, 2025
Paper revision due: June 4, 2025
Acceptance notification: August 6, 2025
Camera-ready deadline: August 13, 2025
Microsoft
Yifan Gong is a Principal Scientist Architect with the Microsoft AI Services, developing multi-modality machine learning and speech-to-text modeling technologies and tools for customers across scenarios/tasks, languages and acoustic environments for both cloud and mobile devices. As a Principal Scientist Architect, Yifan provides expertise and leadership in research, architecture and collaboration. Products he and his group have delivered include Microsoft Azure AI Speech API, meeting caption/transcriptions and translation, multi-channel speaker attribution and diarization, digital voice assistant applications including Cortana/XBox. Prior to joining Microsoft in 2004, he worked as Senior Research Scientist at the National Center of Scientific Research (CNRS, France) and then Senior Member of Technical Staff at Texas Instruments. Yifan received his Ph. D. in Computer Science from the Department of Mathematics and Computer Science, University of Nancy I, France. He authored and co-authored 300+ publications in books, journals, conferences, and ∼100 granted patents.
IEEE Fellow, Yifan served on the Senior Editorial Board of the IEEE Signal Processing Magazine, Member of IEEE Signal Processing Society Technical Directions Board, member of the IEEE Signal Processing Society SLTC (Speech and Language Processing Technical Committee) for several terms (1998-2002, 2012-2016, 2017-2019, 2022-2024), and Chair (2023-2024) of SLTC.
Amazon
Jing Liu is an Applied Science Manager with Amazon Artificial General Intelligence, where he develops efficient algorithms for multi-modal generative AI systems. Jing provides technical leadership, vision, and design of next-generation architectures that are optimized for neural efficiency in Amazon foundation models. The products that Jing and his team have delivered include Alexa all-neural ASR with attention-based contextual biasing for Cloud and Edge systems, as well as Speculative Decoding and Long Context for Amazon Nova multi-modal foundation models. Prior to joining Amazon in 2017, Jing worked for 3 years in quantitative research on Wall Street at JPMorgan. He earned his PhD from the Department of Mathematical Sciences at Carnegie Mellon University in Pittsburgh PA.