Tara N. Sainath
Principal Research Scientist at Google Research, USA.
Tara N. Sainath received her S.B., M.Eng, and PhD in Electrical Engineering and Computer Science (EECS) from Massachusetts Institute of Technology (MIT), Cambridge, MA, USA. The main focus of her Ph.D. work was in acoustic modeling for noise robust speech recognition. After her PhD, she spent 5 years at the Speech and Language Algorithms group at IBM T. J. Watson Research Center, before joining Google Research. She has served as a Program Chair for International Conference on Learning Representations (ICLR) in 2017 and 2018. Also, she has co-organized co-organized numerous special sessions and workshops for many speech and machine learning conferences, including INTERSPEECH 2010, International Conference on Machine Learning (ICML) 2013, INTERSPEECH 2016, and ICML 2017, INTERSPEECH 2019, Neural Information Processing Systems (NIPS) 2020 . In addition, she has served as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. In addition, she is the recipient of the 2021 IEEE SPS Industrial Innovation Award as well as the 2022 IEEE SPS Signal Processing Magazine Best Paper Award. She is currently a Principal Research Scientist at Google, working on applications of deep neural networks for automatic speech recognition. She is a Fellow of IEEE and a Fellow of the ISCA.
Title: Evaluating LLMs on Languages Beyond English: Challenges and Opportunities
Abstract: The assessment of capabilities and limitations of Large Language Models (LLMs) through the lens of evaluation has emerged as a significant area of study. In this talk, I will discuss our research over the last 1.5 years on evaluating LLMs in a multilingual context, highlighting the lessons we learned and the general trends observed across various models. I will also discuss our recent efforts to evaluate Indic LLMs using a hybrid approach of human and LLM evaluators. Lastly, I will touch upon the challenges that remain in both advancing evaluation research and improving multilingual models.
Sunayana Sitaram has been a senior researcher at Microsoft Research India in Bangalore from 2017. Prior to that, she completed her Masters and PhD from the Language Technologies Institute, Carnegie Mellon University (CMU), where she worked on various aspects of speech processing. Her PhD thesis was on pronunciation modeling for speech synthesis of low resource languages, advised by Prof. Alan W. Black. Her research interests span various aspects of speech and natural language processing, particularly for multilingual communities, and her focus at Microsoft Research is on building speech systems that can handle code-switching. She has served on the ISCA student board in the past.
Harish Arsikere is a Principal Applied Scientist at Amazon and a member of the Alexa Speech Recognition team in Bangalore. His research interests span several areas of speech technology including acoustic modeling and adaptation for speech recognition, multilingual modeling, prosody, and human-computer interaction, and end-to-end/all-neural systems. Before joining Amazon, Harish spent two years with Xerox Research Center, India, where he contributed to their speech recognition and analytics platform. He holds a PhD degree in Electrical Engineering from University of California, Los Angeles (UCLA) and a master’s degree in Electrical Engineering from Indian Institute of Technology, Kanpur. Harish has published actively in flagship speech conferences such as INTERSPEECH and ICASSP and in reputed journals, such as J. Acoust. Soc. of America (JASA), Speech Communication, and IEEE Signal Processing Letters.
Senior Research Scientist,
Institute for Infocomm Research (I2R), A*STAR, Singapore.
Title: Representation Learning for Speech: From Unimodal to Multimodal
Abstract: In recent years, the field of representation learning has significantly advanced our ability to process and understand speech. This talk will provide a comprehensive overview of representation learning for speech, tracing its evolution from traditional unsupervised learning methods to cutting-edge self-supervised techniques. We will begin by delving into the historical context, examining the foundational principles of unsupervised learning using neural networks. The discussion will then transition to the emergence of self-supervised learning, a paradigm shift that leverages large-scale unlabeled data to create powerful speech representations. Initially, representation learning focused on specific tasks, such as Automatic Speech Recognition (ASR). However, with the advent of foundation models, a single pretrained model can now address multiple tasks. Key milestones and models, including contrastive learning, predictive coding, and transformer-based architectures, will be highlighted. We will also discuss the practical implications of these advancements, such as improved ASR and paralinguistic tasks, such as speaker identification, emotion detection, and deception detection. The talk will conclude with an overview of multimodal approaches, including the integration of large language models (LLMs), showcasing how these advancements are shaping the future of speech representation learning.
Hardik B. Sailor is currently working as a Senior Research Scientist in Institute for Infocomm Research (I2R), A*STAR, Singapore since May 2022. He worked as a Chief Engineer at Samsung Research Institute Bangalore (SRI-B) from March 2020 to April 2022, where he also provided mentorship to college students under Samsung industry-academic program PRISM. Prior to this, he was a postdoctoral researcher at the Speech and Hearing Research Group at the University of Sheffield, UK (from Feb. 2019 to Feb. 2020). In Jan 2019, he completed Ph.D. from Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, India. He was a project staff member of Ministry of Electronics and Information Technology (MeitY), New Delhi, Govt. of India sponsored consortium project on, "Automatic Speech Recognition for Agricultural Commodities Phase-II". At DA-IICT, he was also a project staff member for MeitY sponsored consortium project on, "Development of Text-to-Speech (TTS) Synthesis Systems for Indian languages Phase-II". His research area includes Automatic Speech Recognition (ASR), representation learning, auditory processing, speech analysis, and sound classification. He received M.Tech (ICT) degree, specialization in Communication Systems in January 2014 from DA-IICT. In 2010, he received the degree of B.E in Electronics and Communications Engineering from Government Engineering College, Surat, India.
Research Scientist,
Samsung Research Institute, Bengaluru, India.
Title: Speech AI, From command recognition to live-call translation.
Abstract: ASR is most commonly used in smartphones voice-enabled applications such as Bixby, Alexa, Google’s voice assistant. In these applications, ASR is used for command recognition purpose. Recent advances in ASR technology enables to extend the speech AI applications from the command-level to conversational speech recognition. Subsequently Live Call Translate, Interpreter, and Transcript Assist applications are introduced in the recent Samsung’s Galaxy AI. To develop such applications several practical challenges like data preparation, background noise, multi-speaker, multi-lingual conditions, memory, and inference time need to be addressed. This talk gives an overview about incremental changes and the various challenges involved in the journey of command-based voice assistants to recent live call translation.
Vikram C. Mathad received the B.E. degree in Electronics and Communication Engineering from PESIT, South Campus, Bengaluru, India, in 2011, the M.Tech. degree in Biomedical Signal Processing and Instrumentation from the Sri Jayachamarajendra College of Engineering, Mysuru, India, in 2013, and a Ph.D. degree in Electronics and Electrical Engineering from the Indian Institute of Technology Guwahati, India, in 2019. He was a Postdoctoral Researcher with the College of Health Solutions, Arizona State University, Tempe, AZ, USA. He was Research Scientist with Zapr Media Labs, Bengaluru, India. Presently, he is with Samsung Research Institute Bangalore. His research interests include speech signal processing, biomedical signal processing, and machine learning.
Senior Research Scientist,
Sony Research, India.
Title: Evolution of Speech Foundation Models and Its Applications in Speech AI
Abstract: In the last couple of years, Large Language Models (LLM)s have brought remarkable advancements in the field of Generative AI, enabling machines to comprehend and generate human-like text. Hence, the research focus has shifted to developing foundation models for vision and speech modalities. In this talk, we will discuss various aspects of speech foundation models that are revolutionizing the landscape of speech recognition and synthesis in the future applications of Speech AI. We will specifically discuss the core principles and transformative potential of speech foundation models in Speech AI. We will explore how these models are pre-trained on vast amounts of speech data to learn contextually relevant speech representations and their potential applications, in the field of speech-to-text, text-to-speech synthesis, language translation, and more. Finally, we will examine the challenges and future directions in the development of Speech GPT, such as improving robustness to diverse accents and dialects, mitigating biases, and enhancing the ethical considerations surrounding AI-driven speech technologies.
Nirmesh J. Shah, received his Ph.D. in Voice Conversion in 2019 from DA-IICT, Gandhinagar. Currently, he is working as a senior research scientist at Sony Research India since 2020. He is primarily leading speech related activities at Sony Research India. He has received several awards at Sony Research India including Director’s most valuable performer, Star Team Players for his numerous contributions in several projects over the past four years. He has also been elevated to senior membership in IEEE in 2023. He has published 30+ research papers in top conferences and peer reviewed journals. He has been serving as a reviewer for many IEEE journals and top conferences, namely ICASSP and INTERSPEECH. He received IEEE SPS Travel Grant to present his research papers at ICASSP 2014, and ICASSP 2017, respectively. He also got ISCA travel grant to present his research paper at MLSLP, 2018; a satellite event of INTERSPEECH 2018. He completed his Ph.D. and Masters in the domain of speech synthesis under guidance of Prof. Hemant A. Patil from DA-IICT, Gandhinagar. During his masters and Ph.D., he was also associated with the consortium project on Development of Text-to-Speech (TTS) Systems in Indian Languages-Phase-II from May 2012-December 2015. He has contributed as a volunteer for most of the summer/winter school organized by Speech Research Lab at DA-IICT, Gandhinagar.
Principal Scientist,
TCS Research & Innovations Labs - Mumbai
Title: Audio & Speech Processing | What have we been doing?
Abstract: Availability of pre-trained acoustic models has narrowed the boundaries between a speech researcher & speech solution developer. Today it is not necessary to understand the fundamentals of speech production or speech perception to build meaningful speech solutions thanks to the wide availability of seemingly robust pre-trained model, be it for speech to text, text to speech or anything in between that might be required to build a solution to operate primarily on a speech signal. In this talk, we will dwell on some recent and a few ongoing activities in the Audio & Speech Processing team. To give a 360 degree view of the things we do, we will not restrict to automatic speech recognition alone.
Dr. Sunil Kumar Kopparapu, a Senior Member of IEEE and ACM in India, holds a doctoral degree in Electrical Engineering from IIT Bombay, India. His career spans roles at CSIRO, Australia, and Aquila Technologies Pvt. Ltd., India, before joining Tata Infotech Limited's Cognitive Systems Research Lab (CSRL). Presently, as a Principal Scientist at TCS Research & Innovations Labs - Mumbai, he focuses on speech, script, and natural language processing, aiming to create practical solutions for Indian conditions. His contributions include co-authoring books, patents, and publications.
Chief Executive Officer Digital India Bhashini Division (BHASHINI),
Digital India Corporation,
Ministry of Electronics & Information Technology (MeitY), New Delhi, INDIA.
Title: Audio & Speech Processing | What have we been doing?
Abstract: BHASHINI aims to transcend language barriers, ensuring that every citizen can effortlessly access digital services in their own language. Using voice as a medium, BHASHINI has the potential to bridge language as well as the digital divide. Launched by Honourable PM Shri Narendra Modi in July 2022 under the National Language Technology Mission, BHASHINI aims to provide technology translation services in 22 scheduled Indian languages. BHASHINI provides a full array of services to overcome language barriers and improve accessibility. It contains an easy-to-use web service portal, a mobile app in beta version, a dataset repository, a specialized services for Speech to Speech, Text to Text, Speech to Text, OCR, and Transliterations, a crowd-sourcing platform known as Bhasha Daan, which invites active data contributions for AI model training. This diverse strategy promotes inclusion, supports different languages, and encourages creativity in products and services.
Amitabh Nag, in his role as the CEO of Digital India's Bhashini Division, holds the responsibility of spearheading the implementation of the National Language Translation Mission (NLTM) aimed at breaking language barriers. With an extensive career spanning more than 30 years, he is a seasoned business leader with expertise in Business Management, Sales, and Marketing, and Project Execution within the Information Technology sector. His track record includes successfully managing Business Units responsible for executing IT-driven transformation programs. Amitabh Nag has a rich professional history, with notable tenures at renowned companies such as Coforge, HP Inc., and TCS. During his career, he has consistently demonstrated his ability to oversee large project deliveries, collaborate with cross-functional teams, and work in partnership with external stakeholders to achieve successful outcomes.
Speech Recognition Engineer,
Augnito, Mumbai, India.
Title: Advancements in Multi-Accent and Noise Robust ASR Using Semi-Supervised Learning and Multimodal Approaches
Abstract: In this talk, we explore cutting-edge techniques and advancements in the realm of ASR, focusing on multi-accent ASR, noise robustness, semi-supervised data generation, and multimodal integration. We begin by delving into the challenges posed by diverse accents and environmental noise in ASR. Leveraging recent developments in semi-supervised learning, we discuss novel approaches to efficiently generate labeled data and improve model performance across varying accent distributions. Next, we address the critical issue of noise robustness in ASR systems. Drawing on insights from recent research, we examine how multimodal approaches, integrating audio and contextual information, enhance the robustness of ASR models to noisy environments. We discuss methods such as noisy language embedding and multimodal pretraining, which enable ASR systems to maintain accuracy even under challenging acoustic conditions. Furthermore, we explore the emerging field of ambient speech recognition, where the goal is to transcribe speech from everyday environments with high accuracy. We analyze recent advancements in this area, including the integration of large language models (LLMs) and the adaptation of transformer architectures for real-time ASR tasks. Finally, we reflect on the broader implications of these technological advancements, particularly in domains, such as ultrasound radiology report generation, where accurate and timely transcription of medical professionals' speech is crucial. By the end of the talk, attendees will gain a comprehensive understanding of the state-of-the-art techniques driving the evolution of ASR systems across diverse applications.
Dipesh K. Singh earned his B.E. degree in Electrical and Electronics Engineering in 2018. He completed his M.Tech. degree in Information and Communication Technology from Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, in 2021. Currently, he is a Speech Recognition Engineer at Augnito, Mumbai, India, specializing in healthcare. Dipesh was a pivotal member of the team that achieved top performance in language diarization task as part of the DISPLACE challenge held during INTERSPEECH 2023, Dublin, Ireland. His research interests include speech processing for the healthcare, children ASR and voice privacy. He has authored numerous papers published in journals and conferences.
Microsoft Research Bengaluru, India.
Title: Voice Privacy in the Age of AI and Big Data
Abstract: In the era of artificial intelligence (AI) and big data, the protection of voice privacy has become a critical issue. As voice-activated technologies and automatic speech recognition (ASR) systems proliferate, the risk of unauthorized access to and misuse of voice data has escalated. This talk will explore the multifaceted dimensions of voice privacy, addressing the unique challenges posed by voice data compared to other personal information. A significant focus will be on anonymization techniques for voice. We will delve into methods, such as voice synthesis, perturbation, and obfuscation that can effectively anonymize speech while maintaining its utility for applications, such as virtual assistants and automated transcription services. We will discuss the strengths and limitations of these techniques, and how they can be integrated into existing systems to protect individual privacy without compromising functionality. This talk will give insights into the balance between innovation and privacy, and the roles that individuals, organizations, and policymakers must play to protect voice privacy in an increasingly connected world.
Gauri Prajapati earned her M.Tech. in ICT from Dhirubhai Ambani Institute of Information and Technology (DA-IICT) in 2021. Presently, she serves as a data scientist at Microsoft Research Bengaluru, India. Her academic journey was marked by contributions to voice privacy initiatives at Speech Research Lab, DA-IICT. Her active involvement led to publications in numerous research papers presented at international conferences, journal publications, and book chapters. Her research interests encompass a wide spectrum within speech technology including Interactive Voice Response (IVR), voice biometrics, voice privacy, and multimodal sentiment analysis.
Digital India Corporation,
Ministry of Electronics & Information Technology (MeitY), New Delhi, INDIA.
Title: Audio & Speech Processing | What have we been doing?
Abstract: BHASHINI aims to transcend language barriers, ensuring that every citizen can effortlessly access digital services in their own language. Using voice as a medium, BHASHINI has the potential to bridge language as well as the digital divide. Launched by Honourable PM Shri Narendra Modi in July 2022 under the National Language Technology Mission, BHASHINI aims to provide technology translation services in 22 scheduled Indian languages. BHASHINI provides a full array of services to overcome language barriers and improve accessibility. It contains an easy-to-use web service portal, a mobile app in beta version, a dataset repository, a specialized services for Speech to Speech, Text to Text, Speech to Text, OCR, and Transliterations, a crowd-sourcing platform known as Bhasha Daan, which invites active data contributions for AI model training. This diverse strategy promotes inclusion, supports different languages, and encourages creativity in products and services.
Ajay Rajawat is a professional specializing in AI/ML, Natural Language Processing, and Innovation Management. Since joining the Digital India Bhashini Division in 2023 as a Manager. I have been dedicated to advancing the innovation ecosystem in India and breaking language barriers through technology. With extensive experience in serving multiple govt. organizations, I focus on propelling India's innovation landscape forward. My efforts are centered on building Bhashini as a key element for national growth, ensuring that digital services are accessible in multiple Indian languages. I have a proven track record of managing national-level programs such as the Smart India Hackathon. My work with the Innovation Cell has been pivotal in promoting innovation and the startup ecosystem in India. I have been responsible for managing IT systems and cloud services for the Ministry of Innovation Cell, designing and scaling Hackathon-as-a-Service (HaaS) systems for both national and international hackathons, and collaborating with ministries to identify IT and data analysis challenges, conducting hackathons to address these issues. Furthermore, I have nurtured startups in the NLP ecosystem, bringing in elite use cases like the live speech-to-speech solution, VaniAnuvaad. My passion lies in fostering a robust startup ecosystem and driving innovation to new heights.