For any queries email: s4p.daiict@gmail.com
Talk: Representation Learning of Speech and its Applications
Abstract: Speech representation learning is a crucial component in various speech processing applications, including speech recognition (ASR), speaker identification, emotion recognition, and language identification. This field focuses on encoding speech signals into compact, informative representations—either as dense embeddings or discrete tokens—which serve as input for downstream tasks. In this presentation, we explore the evolution of speech encoding techniques, beginning with a widely-adopted unsupervised or self-supervised learning approach. We then delve into a series of advancements that have significantly improved the performance of speech representation models (or speech encoders), specifically, task-aware training, knowledge distillation, dual-mode encoding capabilities, and language-aware encoding through attention mechanisms. These innovations have collectively enhanced the accuracy of speech-based applications. Furthermore, we examine the effects of integrating these advanced speech encoders with multimodal large language models (LLMs). The talk concludes with a comparative analysis of our developed models against other external ASR models.
Dr. Sri Garimella is the Director, Applied Science, Bengaluru, heading the Amazon Artificial General Intelligence (AGI, Speech Understanding) teams in India and Europe. His organization is responsible for building and advancing the core multimodal LLM capabilities, and speech and natural language understanding technologies at AGI. He has been associated with Amazon for more than 12 years. He obtained PhD from the Department of Electrical and Computer Engineering, Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, USA in 2012, and Master of Engineering in Signal Processing from the Indian Institute of Science (IISc), Bengaluru, India in 2006.
Talk: Audio & Speech Processing | What have we been doing?
Abstract: In this talk, we will explore some of the more recent work that has been happening in TCS in the area of audio and speech signal processing. We will look at aspects that are required to enable voice user interface for the emergent user. Some of the things that we will cover are (a) how the knowledge of the microphone location impacts the performance of an ASR, (b) a novel data argumentation method for enabling robust ASR, and (c) the importance of choosing an appropriate vocabulary size hyper parameter in an e2e ASR. Time permitting, we will look at a spoken grammar assessment tool, the need for a new metric for audio captioning and some experiments around text to speech.
Dr. Sunil Kumar Kopparapu, a Senior Member of IEEE and ACM in India, holds a doctoral degree in Electrical Engineering from IIT Bombay, India. His career spans roles at CSIRO, Australia, and Aquila Technologies Pvt. Ltd., India, before joining Tata Infotech Limited's Cognitive Systems Research Lab (CSRL). Presently, as a Principal Scientist at TCS Research & Innovations Labs - Mumbai, he focuses on speech, script, and natural language processing, aiming to create practical solutions for Indian conditions. His contributions include co-authoring books, patents, and publications.
Talk: Advances in Speech Large Language Models for Recognition and Translation
Abstract: This talk explores the latest progress in Speech Large Language Models (Speech LLMs), with a focus on their use in automatic speech recognition (ASR) and automatic speech translation (AST). Unlike traditional cascaded systems, Speech LLMs enable end-to-end modeling of spoken language, offering improved contextual understanding and more natural human-like interactions. We examine recent architectural innovations integrating speech and language through unified tokenization and adaptation mechanisms. While ASR datasets are increasingly available, there remains a critical lack of high-quality AST resources for Indian languages. To bridge this gap, we introduce IndicST, a new dataset for training and evaluating Speech LLMs in the Indian linguistic context. Additionally, we analyze how component-level changes within Speech LLMs impact ASR and AST outcomes across Indian languages. The session concludes by addressing key challenges and future opportunities for real-world deployment of these models.
Dr. Nagaraj Adiga leads Multimodal LLM development as a Senior Principal Research Scientist at Outcomes AI, catering solutions for healthcare applications. He has deep expertise in audio language models, speech recognition, synthesis, and enhancement technologies. He earned his Ph.D. from IIT Guwahati and his postdoc at the University of Crete, Greece. His industry experience includes positions at Nokia, Apple, Zapr Media Labs, Samsung and Krutrim.AI prior to joining Outcomes AI. His research interests include machine learning, signal, and speech processing.
Talk: Towards Universal Audio Understanding: A Unified Encoder for Speech and Audio Tasks
Abstract: Recent advancements in speech and audio encoders have drawn significant attention due to their integration with Large Language Models for diverse acoustic tasks. While most research has focused on developing specialized encoders for either speech or audio domains, with limited solutions addressing streaming constraints, there remains a critical gap in unified approaches. This presentation introduces a novel universal audio-speech encoder designed to process the complete spectrum of acoustic inputs, from human speech to environmental sounds. Our encoder generates robust representations that seamlessly interface with large language models for multiple downstream tasks, including automatic speech recognition, speech translation, audio captioning, and event detection. We address the fundamental challenges of unifying traditionally separate speech and audio encoding paradigms while effectively handling both streaming and non-streaming applications. Through our analysis of existing foundation models, we identify their limitations and present innovative techniques to bridge these gaps. Experimental results demonstrate that our universal encoder achieves comparable or superior performance to specialized models across various benchmarks, marking a significant step toward a truly unified audio processing framework.
Dr. Debmalya Chakrabarty is currently working as Senior Applied Scientist at Amazon AGI Alexa Machine Learning team with expertise in speech processing, auditory scene analysis, and machine learning for acoustic modeling. He holds a Ph.D. in Electrical and Computer Engineering from Johns Hopkins University (JHU), USA. Debmalya has contributed significantly to improving Alexa's speech recognition frameworks catering to indic recognition, working on projects involving CTC-based acoustic models, semi-supervised learning, and Transformer/Conformer architectures with focus on Multilingual speech recognition. He has published extensively in prestigious venues like ICASSP and SLT on topics such as ASR, temporal dynamics augmented acoustic scene analysis, and speaker verification frameworks.
Talk: Automatic Speech Recognition: From Problem to Research to Productisation
Abstract: Automatic speech recognition (ASR) is an essential part of Samsung Galaxy AI features such as, transcript assist, live translate, and call transcript. Real-time usefulness of such features require the ASR system to be robust against various practical challenges like, background noises, room reverberation, effects of recorded/playback sound, latency, on-device memory, etc. This talk discusses how Galaxy AI team aims towards recognizing such challenges, overcoming them, and providing product-based solutions over speech from network calls and listening mode applications.
Dr. Premjeet Singh is Lead Engineer in Samsung Research Institute Bengaluru, India. He received his Ph.D. in Speech Signal Processing from Indian Institute of Technology (IIT), Kharagpur. His Ph.D. research centres on developing mathematical models to improve the extraction of nuanced emotional cues from speech and using them along with deep learning models for improved speech emotion recognition. Some of his notable publications involve using Modulation spectral features, constant-Q filterbank based representations, deep scattering network for speech emotion recognition in several esteemed conferences like INTERSPEECH, SPCOM and EUSIPCO.
Senior Research Scientist, Sony Research India, Bengaluru.
Talk: Generative Modeling for Emotional Speech Synthesis: Progress, Pitfalls, and Possibilities
Abstract: Emotional speech synthesis aims to generate human-like speech that conveys not only linguistic content but also expressive emotional states. As emotionally rich speech becomes critical for applications in entertainment, virtual storytelling, and immersive user experiences, this domain holds the potential to bridge the emotional gap between humans and machines. Understanding and replicating emotions in speech is not just a technical challenge, but a step toward more natural and empathetic human-computer interaction. This talk presents a comprehensive overview of the emotional speech synthesis field, highlighting key developments and challenges in modeling emotional prosody and style. We will explore the role of generative modeling techniques in advancing emotional Text-to-Speech (TTS), covering how these methods enable more controllable, diverse, and realistic synthesis. The talk will also discuss fundamental issues such as emotion representation, emotion controllability, and the evaluation of emotional expressiveness. Emphasis will be placed on current trends and open research problems, offering insights into future directions for building emotionally aware and socially intelligent speech systems. Finally, I will briefly share insights from our recent paper, which introduces a method for emotion intensity regularization in emotional voice conversion, contributing toward finer emotional control in synthesized speech.
Dr. Nirmesh J. Shah, received his Ph.D. in Voice Conversion in 2019 from DAU (formerly DA-IICT), Gandhinagar. Currently, he is working as a senior research scientist at Sony Research India since 2020. He is primarily leading speech related activities at Sony Research India. He has received several awards at Sony Research India including Director’s most valuable performer, Star Team Players for his numerous contributions in several projects over the past four years. He has also been elevated to senior membership in IEEE in 2023. He has published 30+ research papers in top conferences and peer reviewed journals. He has been serving as a reviewer for many IEEE journals and top conferences, namely ICASSP and INTERSPEECH. He received IEEE SPS Travel Grant to present his research papers at ICASSP 2014, and ICASSP 2017, respectively. He also got ISCA travel grant to present his research paper at MLSLP, 2018; a satellite event of INTERSPEECH 2018. He completed his Ph.D. and Masters in the domain of speech synthesis under guidance of Prof. Hemant A. Patil from DAU, Gandhinagar. During his masters and Ph.D., he was also associated with the consortium project on Development of Text-to-Speech (TTS) Systems in Indian Languages-Phase-II from May 2012-December 2015. He has contributed as a volunteer for most of the summer/winter school organized by Speech Research Lab at DAU, Gandhinagar.
Senior AI Scientist, Uniphore, Bengaluru.
Talk: Beyond Supervision: Leveraging Pseudo-Labels and LLMs for Domain-Specific ASR
Abstract: Training high-quality automatic speech recognition (ASR) models typically requires extensive transcribed data, posing a significant challenge for domain adaptation in scenarios with limited or no labeled audio. In this talk, I will introduce a unified approach for training transducer-based ASR models in a semi-supervised setting, leveraging large unlabeled corpora and minimal or zero manual annotation. This work explores the generation and refinement of pseudo-labels using outputs from multiple ASR models, enhanced through consensus mechanisms, large language models (LLMs), and speech- based LLMs (SpeechLLMs). We propose a flexible framework that combines model prompting, multi-system alignment, and filtering techniques based on consensus voting, named entity recognition (NER), and error-rate prediction. Our experiments across diverse datasets including call center and conversational corpora demonstrate that these strategies not only improve the quality of pseudo-labels but also enable scalable training of ASR models with significantly reduced reliance on human-annotated data. The findings demonstrate the potential of semi-supervised pipelines to democratize ASR development, especially in low-resource or domain-specific settings.
Dr. Bidisha Sharma received Ph.D. degree from Indian Institute of Technology (IIT) Guwahati in the year 2018. Her Ph.D. work focused on improving the quality of synthesized speech obtained from a text-to-speech synthesis (TTS) system. After completing Ph.D., initially, she worked as a Research Fellow in Sound and Music Computing (SMC) laboratory followed by Human Language Technology (HLT) laboratory at National University of Singapore till September 2021, where, she worked on projects related to Automatic Speech Recognition, and Spoken Language Understanding. Dr Bidisha Sharma is currently working as a Senior AI Scientist in Uniphore, with a passion to advance conversational audio solutions. Her research interests lie in speech processing, encompassing areas like automatic speech recognition, text-to-speech synthesis, speech enhancement, and end-to-end spoken language understanding and music processing. Dr. Bidisha is an active member of the organizing committee for IEEE ASRU 2019, SIGDIAL 2021, IWSDS 2021 and COCOSDA 2021 conferences. She was a co-chair of Young Female
Researchers Mentoring (YFRM) at ASRU 2019 and Postdoctoral Mentor at Mentoring event, Interspeech 2019.
Speech Recognition Engineer, Augnito, Mumbai, India.
Talk: Next-Generation Speech Recognition: Scaling Self-Supervised Learning, Multimodal Fusion, and LLM-Augmented ASR for Real-World Deployment
Abstract: The field of automatic speech recognition (ASR) has undergone transformative changes with the rise of self-supervised learning (SSL), large language models (LLMs), and multimodal foundation models. In this talk, we explore how these advancements address persistent challenges in multi-accent generalization, noise robustness, and real-time ambient speech processing. We begin by examining how self-supervised pretraining (e.g., wav2vec 3.0, Whisper-v3) has reduced reliance on labeled data while improving cross-accent generalization. Next, we discuss innovations in noise-robust ASR, including dynamic acoustic adaptation and neural dereverberation techniques that leverage visual or contextual cues for improved performance in chaotic environments. A key focus is the integration of LLMs into end-to-end ASR systems, enabling not just transcription but semantic disambiguation, speaker-adaptive correction, and task-aware contextualization (e.g., for medical or legal domains). We also highlight emerging work on "ASR as a sensor"—using speech recognition for ambient intelligence in healthcare, education, and human-computer interaction.
Dipesh K. Singh earned his B.E. degree in Electrical and Electronics Engineering in 2018. He completed his M.Tech. degree in Information and Communication Technology from DAU (formerly DA-IICT), Gandhinagar, in 2021. Currently, he is a Speech Recognition Engineer at Augnito, Mumbai, India, specializing in healthcare. Dipesh was a pivotal member of the team that achieved top performance in language diarization task as part of the DISPLACE challenge held during INTERSPEECH 2023, Dublin, Ireland. His research interests include speech processing for the healthcare, children ASR and voice privacy. He has authored numerous papers published in journals and conferences.
Speech Solutions Architect, Gnani.ai, Bengaluru.
Talk: Building Real-Time Voice-to-Voice LLMs: Toward Expressive, Multilingual, and Task-Aware AI
Abstract: Voice-to-voice systems are redefining the boundaries of conversational AI by enabling direct, real-time speech interactions—where input speech is understood, reasoned over, and responded to with expressive, human-like output speech. These systems go beyond simple transcription or translation, aiming to capture the richness of human communication across language, emotion, and intent. This talk explores the core components and design considerations behind such systems: modeling multilingual speech, preserving speaker affect and prosody, and enabling intelligent, task-aware responses. Multitask training approaches that combine speech recognition, language identification, and emotion modeling help develop representations that are robust across speakers, dialects, and use cases. Speech generation models are guided not only by linguistic content but also by cues from the speaker’s tone, emotion, and conversational rhythm. A critical part of enabling natural interaction is low-latency end-of-utterance detection, which determines when a system should respond. This timing mechanism plays a vital role in creating responsive, turn-based dialogues, especially in real-time environments where naturalness is key. Use cases such as real-time voice-to-voice translation, spoken task execution, and interactive AI agents illustrate the promise of voice-native systems that can understand and act through speech alone. Rather than treating speech simply as a carrier for text, these systems view it as the primary interface for reasoning, action, and expression. This talk offers a look at the current capabilities and limitations of voice-to-voice systems, and the emerging research directions driving them toward more natural, multilingual, and emotionally intelligent communication.
Thoshith S leads Speech AI at gnani.ai, where he drives the development and deployment of cutting-edge voice technologies at scale. With over seven years of experience in the field, he has contributed across the evolution of speech systems—from classic HMM-GMM and TDNN pipelines, through multilingual streaming RNNT and Conformer, to today’s end-to-end voice-to-voice large language models that listen, reason, and reply in real time. His expertise spans the entire lifecycle: data strategy, model architecture and adaptation, large-scale training, cloud infrastructure, and post-deployment optimization. At gnani.ai, Thoshith has spearheaded end-to-end Speech AI initiatives across industries such as banking, insurance, automotive, and customer service. The ASR systems he has architected now handle over 10 million voice interactions per day, delivering high-accuracy transcription, real-time intent recognition, and secure voice biometrics in multiple languages. He brings a combination of technical depth and product acumen, ensuring that research-driven innovation translates into reliable, scalable solutions in production environments. With a strong focus on real-time performance, linguistic diversity, and regulatory readiness, Thoshith is committed to building speech systems that are accurate, robust, and enterprise-ready—advancing the future of voice-first human-machine interaction.