Hynek Hermansky
Professor Emeritus,
Department of Electrical and Computer Engineering, Johns Hopkins University, USA.
Title: Why should we ask why?
Abstract: We often present advances in automatic recognition of speech (ASR) by describing the most successful configuration of available open software processing modules, sometimes adding new elements, and reporting the accuracy of the obtained results. So, what is being reported to the community is HOW the work was done and WHAT has been the output. That is understandable since reviewers are evaluating our papers by checking if the work is replicable (the HOW element) and if the progress is demonstrated (the WHAT element). However, one can argue that more scientific progress could be made when the report also contains an explanation of WHY the processing was effective. Some attempts to follow this advice in our own work are discussed in the talk.
Hynek Hermansky received a M.S. in Electrical Engineering (1972) from Technical University Brno, Czech Republic and a Ph.D. in Electrical Engineering (1983) from University of Tokyo, Japan. He has been at the forefront of groundbreaking research in human hearing and speech technology research for more than three decades, both in industrial research labs and in academia. The main focus of Hermansky’s is on using bio-inspired methods to recognize information in speech-related signals. Hermansky currently holds the position of Research Professor at Brno University of Technology in the Czech Republic. He is Julian S. Smith Professor Emeritus at Johns Hopkins University where he was for ten years leading an internationally acclaimed group of Johns Hopkins faculty, students, and visiting researchers at the Center for Language and Speech Processing (CLSP), which comprises one of the largest and most prestigious speech and language-oriented academic groups in the world. His past affiliations include the director of research at IDIAP Research Institute, Martigny, Switzerland (2003-2008), titular professor at the Swiss Federal Institute of Technology in Lausanne, Switzerland (2005-2008), professor at the Oregon Health and Sciences University (previously Oregon Graduate Institute), a senior member of the Research Staff at the U.S. WEST Advanced Technologies in Boulder, CO, and research engineer at Panasonic Technologies in Santa Barbara, California.
His achievements include more than 300 peer-reviewed papers with more than 20,000 citations, and 13 patents, with another eight pending applications in topics such as a method for identifying keywords in machine recognition of speech based on the detection and classification of sparse speech sound events; a system to compute speech recognition for cell phone; and an auditory model to detect speech corrupted by additional background noises. Hermansky’s scientific contributions were recognized by the Institute of Electrical and Electronics Engineers (IEEE), which awarded him awarded him the 2021 James L. Flanagan Speech and Audio Processing Medal, and the International Speech Communication Association (ISCA), which awarded him in 2013 its highest honor, the Medal for Scientific Achievement. Hermansky’s service to the field is extensive and noteworthy. He is a Life Fellow of the IEEE, a Fellow of the ISCA, and an External Fellow of the International Computer Science Institute. Highly sought-after by the industry for his expertise, he is a current member of the advisory board for Germany’s Hearing4All Scientific Consortium Center of Excellence in Hearing Research, and he has served on advisory boards for Amazon, Audience, Inc., and VoiceBox Inc. His professional memberships include IEEE and ISCA, where he was twice elected as a board member. He is a member of the editorial board of Speech Communication, and he also was an associate editor for IEEE Transaction on Speech and Audio and a former member of the editorial boards for Phonetica. Hermansky serves in leadership roles for the field’s key workshops and conferences presents invited lectures and keynote presentations around the globe and were lecturing worldwide as the Distinguished Lecturer for ISCA and for IEEE. Hermansky is the General Chair of INTERSPEECH 2021 in Brno, Czech Republic, was a General Chair of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), and chair of the technical committee for the ICASSP 2000. In addition to leading several Hopkins’ CLSP workshops, he was also on the organizational committee for ASRU 2017, ASRU2013 and ASRU 2005, for ten years was the executive chair of the annual ISCA-sponsored workshops on Text, Speech, and Dialogue in the Czech Republic, and was a tutorial speaker at INTERSPEECH 2015.
Bhuvana Ramabhadran
Speech Recognition Researcher,
Google Research, USA.
Bhuvana Ramabhadran received her Ph.D. degree in electrical engineering from the University of Houston. Currently, she leads a team of researchers at Google, focusing on semi-supervised learning for speech recognition and multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as the principal investigator on two major international projects: the National Science Foundation (NSF) sponsored Multilingual Access to Large Spoken Archives (MALACH) project, and the European Union (EU) sponsored TCSTAR project, and the lead with IBM for the Spoken Term Detection evaluation in 2006. She was responsible for acoustic and language modeling research for both, commercial and government projects ranging from voice search and transcription tasks to spoken term detection in multiple languages and expressive synthesis for IBM Watson. She has served as an elected member of the IEEE Signal Processing Society (SPS) Speech and Language Technical Committee (SLTC) (2015-2017), for two terms since 2010 and as its elected Vice Chair and Chair (2014–2016), and currently serves as an Advisory Member. She has served as the Area Chair for ICASSP (2011–2018), INTERSPEECH (2012, 2014-2016), on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), and on the IEEE SPS conference board (2017-2018) during which she also served as the conference board’s liaison with the ICASSP organizing committees, and as Regional Director-At-Large (2018-2020), where she coordinated work across all US IEEE chapters. She also organized IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) in 2011.
She currently serves as the Chair of the IEEE Flanagan Speech & Audio Award Committee, and currently serves as a Member-at-Large of the IEEE SPS Board of Governors (BoG). She serves on the International Speech Communication Association (ISCA) board in her capacity as ISCA Vice President (2023-2025). In addition to organizing several workshops at International Conference on Machine Learning (ICML), HLT-NAACL Neural Information Processing Systems (NIPS) and ICML, she has also served as an adjunct professor at Columbia University, where she co-taught a graduate course on speech recognition. She has served as the (Co/-)Principal Investigator on several projects funded by the NSF, EU, and Intelligence Advanced Research Projects Activity (IARPA), spanning speech recognition, information retrieval from spoken archives, keyword spotting in many languages. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning. She is a Fellow of IEEE and a Fellow of the ISCA.
Title: Fundamentals of Automatic Speech Recognition – A Symbolic Perspective
Abstract: Over the decades, automatic speech recognition (ASR) approaches have evolved from more knowledge-driven to data-driven. A question that arises is: whether the approaches are so different from each other? In this series of presentations, I will present a symbolic perspective of the ASR problem through which, I will provide links between (a) knowledge-based approach, (b) instance-/template-based approach, and (c) statistical ASR approach, and show that these approaches are not all that different.
Talk 1: In this lecture, I will present an abstract formulation of the ASR problem, where ASR can be seen as combination of language generation (generation of word hypotheses) and matching of word hypothesis with the observed speech signal. Based on that formulation, I will elucidate knowledge-based approach and instance-based approach and discuss relevance of these approaches in the deep learning-based ASR era.
Talk 2: This lecture will extend the abstract formulation to statistical ASR approach. In that direction, I will focus on Bayesian formulation of the ASR problem and will then largely deal with likelihood-based ASR approach, more precisely, hidden Markov model-based approach. I will dwell into different aspects such as, (a) different types of statistical estimators, (b) pronunciation modeling, and (c) end-to-end learning.
Talk 3: The third lecture will continue with the statistical ASR approach, where the focus will be on “posterior-based” approach. I will present an HMM-based approach where the HMM states are parameterized by categorical distributions. I will demonstrate how such an approach allows us (a) to overcome some of the limitations of conventional HMM-based approach (presented in Talk 2), (b) to unify instance-based approach and HMM-based approach, (c) to model different types of subword units and phonological representations, (d) to deal with data-scarcity issues, and (e) to holistically deal with speech recognition and speech assessment.
Mathew Magimai Doss received the Bachelor of Engineering (B.E.) in Instrumentation and Control Engineering from the University of Madras, India in 1996; the Master of Science (M.S.) by Research in Computer Science and Engineering from the Indian Institute of Technology, Madras,India in 1999; the PreDoctoral diploma and the Docteur ès Sciences (Ph.D.) from the Ecole polytechnique fédérale de Lausanne (EPFL), Switzerland in 2000 and 2005, respectively. He was a postdoctoral fellow at the International Computer Science Institute (ICSI),Berkeley, USA from April 2006 till March 2007. He is now a Senior Researcher at the Idiap Research Institute, Martigny, Switzerland. He is also a lecturer at EPFL. His main research interest lies in signal processing, statistical pattern recognition, artificial neural networks and computational linguistics with applications to speech and audio processing, sign language processing and multimodal signal processing. He is a member of IEEE, ISCA, and Sigma Xi. He is an Associate Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is the coordinator of SNSF Sinergia project SMILE-II, which is focusing on continuous sign language recognition, generation and assessment. He was the coordinator of recently completed H2020 Marie Sklodowska Curie Actions ITN-ETN project TAPAS, which focused on pathological speech processing. He has published over 180 journal and conference papers. The Speech Communication paper "End-to-end acoustic modeling using Convolution Neural Networks for HMM-based Automatic Speech Recognition" co-authored by him and published in 2019 received the 2023 EURASIP Best Paper Award for SPEECH COMMUNICATION Journal and ISCA Award for the Best Paper published in Speech Communication (2017-2021). The Interspeech 2015 paper "Objective Intelligibility Assessment of Text-to-Speech Systems Systems through Utterance Verfication" received one of three best student paper award.
Chng Eng Siong
Associate Professor,
Nanyang Technological University (NTU), Singapore.
Title Talk1: The NTU Speech Teams Experience in Adapting Whisper
Abstract: Whisper is a speech recognition model released by OpenAI at end of 2022. It is now one of the most impactful model based on transformer for the speech community. Whisper has been trained with 680K hours, is capable of transcribing speech from 96 language and converting these languages to English. Due to its open-sourced nature and SOTA (State of the art) performance, the speech-research community has widely adopted Whisper as a foundation model. Researchers have further enhanced it for various applications, including adaptation to accented, underresourced and code-switch speech, as well as for streaming and real time transcription. Advancing Automatic Speech Recognition with Whisper and Transformers In this work, we present our team's efforts from the Nanyang Technological University (NTU) to leverage the Whisper model and transformer architectures for enhancing automatic speech recognition (ASR) capabilities. Specifically, we focus on the following contribution: 1) Code-switch speech recognition – By fine-tuning and modifying the language prompts in Whisper, our team demonstrated the model's ability to perform code-switched transcription, achieving state-of-the-art results on the SEAME (South-East Asia Code-switch Mandarin English) corpus. Our experiment results show that Whisper can effectively handle code-switching between multiple languages within the same utterance. 2) Speaker aware decoding – The vanilla Whisper model is speaker-agnostic, designed to be robust against variations in speaker identity, accent, and noise. However, research indicates that recognition accuracy can be further improved with target speaker’s information. For instance, using speaker adaptation data or speaker identity allows for fine-tuning or conditioning the model. In our previous work, we have demonstrated that incorporating a speaker-identity vector into the transformer encoder's key-value input makes the model speaker-aware. Experiments on the LibriSpeech, Switchboard, and AISHELL-1 ASR tasks showed that our proposed model achieved relative word error rate (WER) reductions of 4.7% to 12.5%.
Title Talk2: Enabling LLM for ASR
Abstract: The decoder only LLM such as ChatGPT was originally developed to only accept text as input. Recent advances have enabled it for other modalities: such as audio, video and images. Our focus in this talk is the integration of speech modality into LLM. For this task, the research community has proposed various innovative approaches: e.g, applying discrete representations, integrating pre-trained encoder to existing LLM decoder architectures (Qwen) , multitask learning and multimodal pertaining. In the talk, I will a) review the recent approaches of ASR task using LLM, and b) introduce 2 of our NTU’s speech lab works for this task: i) “Hyporadise”: Applying LLM on N-best hypothesis generated by traditional ASR models to improve the top1 ASR transcription result. Our results show that LLM not only exceed the performance of traditional LM re-scoring, LLM can recover and generate correct words not found in the N-best hypothesis - we call such an ability GER (Generative Error Correction). ii) Leveraging LLMs for ASR and Noise-Robust ASR: In this work, we extend Hyporadise approach to include hypothesis (language) noise information into the LLM. Our insight is that under low SNR speech condition, there will be more diverse N-best hypothesis due to higher decoding uncertainty. This diversity can be captured and represented as an embedding vector called noisy language embedding. This embedding can then be exploited as a prompt. With fine-tuning on a training set, the LLM can be shown to have improve performance for the GER task.
Chng Eng Siong is currently an Associate Professor in the College of Computing and Data Science (CCDS) at Nanyang Technological University (NTU) in Singapore. Prior to joining NTU in 2003, he worked at Knowles Electronics (USA), Lernout and Hauspie (Belgium), the Institute of Infocomm Research (I2R) in Singapore, and RIKEN in Japan. He received both a PhD and a BEng (Hons) from the University of Edinburgh, U.K., in 1996 and 1991, respectively, specializing in digital signal processing. His areas of expertise include speech research, Large Language Models, machine learning, and speech enhancement. He currently serves as the Principal Investigator (PI) of the AI-Singapore Speech Lab from 2023 to 2025. Throughout his career, he has secured research grants from various institutions, including Alibaba ANGEL Lab, NTU-Rolls Royce, Mindef, MOE, and AStar. These grants, totaling over S$18 million, were awarded under the “Speech and Language Technology Program (SLTP)” in the School of Computer Science and Engineering (SCSE) at NTU. In recognition of his expertise, he was awarded the Tan Chin Tuan fellowship in 2007 to conduct research at Tsinghua University in Fang Zheng’s lab. Additionally, he received the JSPS travel grant award in 2008 to visit Tokyo Institute of Technology in Furui’s Lab. He has supervised the graduation of over 19 PhD students and 13 Masters students. His publication record includes 2 edited books and over 200 journal and conference papers. Additionally, he has contributed to the academic community by serving as the publication chair for 5 international conferences, including Human Agent Interaction 2016, INTERSPEECH 2014, APSIPA-2010, APSIPA-2011, and ISCSLP-2006. Furthermore, he is in the organizing committee for ASRU 2019 (Singapore), ICAICTA 2024 (General Co-chair) and SLT 2024 (General Co-chair).
Srikanth Madikeri
Lecturer,
University of Zurich, Switzerland.
Title Talk 1: Loss functions for Training Automatic Speech Recognition Systems
This talk will focus on commonly used loss functions to train neural-network based speech recognition systems in a supervised fashion. We will cover the fundamental cross-entropy loss, the popular connectionist temporal classification loss, and discriminative training with MMI. Finally, we will look at the most recent Transducer based approach to training ASR. The goal of this talk is to build on the content from other lectures, and understand the similarity and differences in the different approaches.
Title Talk 2: Publicly available open-source ASR models and their applications
Nowadays, the most common approach to train an ASR for a custom domain or a new language involves fine-tuning a bootstrap model trained from 1000s of hours to million hours of data in either supervised, weakly supervised, or self-supervised fashion. In this talk we will look at the different open-source options available for such bootstrapping. We will look at the commonly available options such as wav2vec 2.0, HuBERT, WavLM, etc., understand the architectural differences, advantages and limitations.
Title Talk 3: Training ASR with limited resources
In this talk we will look at different strategies, often complementary to each other, to leverage pre-trained acoustic models in conditions with limited resources. We will address situations that involve hardware and data constraints. First, we will look at semi-supervised learning for low-resource conditions. To address limited hardware constraints we will introduce parameter efficient fine-tuning methods such as Low Rank Adapters and its variants.
Srikanth Madikeri obtained his Ph.D. in Computer Science and Engineering from the Indian Institute of Technology (IIT) Madras (India) in 2013. During his Ph.D., he worked on automatic speaker recognition and spoken keyword spotting. He worked as a Postdoctoral researcher and Research Associate at Idiap Research Institute (Martigny, Switzerland) in the Speech Processing group from 2013 to 2024. He is currently a Lecturer in Language Technology at the Department of Computational Linguistics, University of Zurich, Switzerland. His research interests include - Automatic Speech Recognition for low resource languages, Automatic Speaker and Language Recognition, Speaker Diarization and Spoken dialog systems.
Title: Processing Phase of Speech Signals
Abstract: In this somewhat provocative talk, I would like to discuss the need to process only the phase spectra of signals, in general, and of speech signals, in particular. I will show that phase spectrum has all the information, whereas the magnitude spectrum has limited information, which in principle can be derived from the phase spectrum. In order to understand and exploit the phase spectral information in signals, it is necessary to obtain the true phase without wrapping. Recently, I have proposed a method to obtain phase without the need for phase wrapping. I will show that many of the speech production features can be derived from the phase representation of the speech signal. It appears that speech information need to be represented only through the phase spectrum, rather than through the magnitude spectrum for most of the speech applications. I will give the necessary signal processing background to appeciate the points I will be making in this talk. Most of this work is not published yet, but I would like to take this opportunity to introduce these new ideas for the first time to the workshop audience.
Bayya Yegnanarayana, is currently INSA Hon Scientist at IIIT Hyderabad. He was Professor Emeritus at BITS-Pilani Hyderabad Campus
during 2016. He was an Institute Professor from 2012 to 2016 and Professor & Microsoft Chair from 2006 to 2012 at IIIT Hyderabad. He was a professor at IIT Madras (1980 to 2006), a visiting associate professor at Carnegie-Mellon University, Pittsburgh, USA (1977 to 1980), and a member of the faculty at the Indian Institute of Science (IISc), Bangalore, (1966 to 1978). His research interests are in signal processing, speech, image processing and neural networks. He is Fellow of INAE, INSA, IASc, IEEE, ISCA, APAS, and a Life Fellow of IIT Kharagpur. He has published over 400 papers, and supervised over 80 theses at Master and PhD levels. He is also an adjunct faculty at IIT Tirupati.
Title Talk 1: An Overview of Traditional Approaches to Automatic Speech Recognition
Over the past decade, deep learning-based approaches to speech recognition have yielded excellent results. However, the deep learning models lack transparency unlike the traditional models where the decision-making process is explainable. In this talk, I will give an overview of the salient features of traditional models such as DTW (Sequential Pattern Matching), GMM and HMM (Statistical models) and DNN-HMM (hybrid model).
Title Talk 2: ASR of low resource Indian languages
Training deep learning models generally requires hundreds of hours of annotated speech data. In India, there are 121 languages, each spoken by at least 10,000 speakers. Many of these languages lack speech data needed to train deep learning models. However, there are significant similarities among the phones of these languages. Exploiting such similarities via a ‘common label set’ for phonemes of Indian languages permits us to use the speech data of many Indian languages to train acoustic model of a low resource Indian language. I will talk about one such attempt, ‘Indian Language Speech Label’ and its utility in building speech systems for Indian languages.
Samudravijaya K is a Professor at Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India. He carried out research in the area of Spoken Language Processing at Tata Institute of Fundamental Research, Mumbai for 3 decades. Later, he was a Visiting Faculty at IIT Guwahati during 2016-2020, and an Adjunct Faculty at IIT Dharwad during 2021-2022. He received awards such as 'Best Ph.D. Thesis Award' during the year 1986, UNDP Fellowship for research at Carnegie Mellon University, USA in 1988, Sir C V Raman Award by the Acoustic Society of India in 2003 etc. His research interests include speech and speaker recognition, voice enabled information access systems and spoken language resources. He has extensively used Sphinx, HTK and Kaldi toolkits for speech recognition. During his visit to CMU in 1988, he implemented 'Dhwani', the first continuous speech recognition system for an Indian language (Hindi). He has served as speaker at winter and summer schools, PI/co-PI of several sponsored projects, member of project review committees and industrial consultant.
Dean of Research and Development
International Institute of Information Technology, Hyderabad
Title: Seeing is Listening
Abstract: Understanding how humans perceive the signals around has been always fascinating. Traditionally computer vision and speech processing have identified themselves as areas with very less to share. Though there have been cognitive studies on the relationship between these two modalities of perception, computational approaches were very different. In recent years, we have been seeing more convergence in the computational methods. Far more, we are seeing the emergence of audio-visual methods where one modality helps/catalyze the perception of other modality. In this talk, we discuss some of the recent works and directions.
C. V. Jawahar is a professor and Head of the Centre for Visual Information Technology-CVIT, at the International Institute of Information Technology, Hyderabad (IIITH), India. At IIIT Hyderabad, he leads the research group focusing on computer vision, machine learning, and multimedia systems. In recent years, he has been actively involved in research questions in Computer Vision with emphasis on mobility, healthcare, and Indian language computing.. He is also interested in large-scale multimedia systems with a special focus on assistive technology solutions. Prof. Jawahar is an elected Fellow of the Indian National Academy of Engineers (INAE) and the International Association of Pattern Recognition(IAPR). His research is globally recognized in the Artificial Intelligence and Computer Vision research community with more than 200 publications in top tier conferences and journals in computer vision, robotics and document image processing to his credit with over 18000 citations. He is awarded the ACM India Outstanding Contribution to Computing Education (OCCE) 2021. He is actively engaged with several government agencies, ministries, and leading companies around innovating at scale through research.
Associate Professor of Electric Engineering, Indian Institute of Science, Bangalore
& Google Research, Bangalore, India
Title: Beyond the Frame: Multi-Scale Self-Supervised Speech Representation Learning
Abstract: This talk delves into the exciting field of self-supervised learning (SSL) for speech processing, specifically focusing on capturing the rich, multi-scale information embedded within speech signals. While conventional SSL approaches primarily target frame-level representations (20-30 ms), capturing semantic content, speech inherently encompasses information at various levels: utterance-level non-semantic cues and even recording session-specific channel/ambient characteristics. I will review key aspects of prior works on speech representation learning at frame and utterance levels that are prevalent in the field.
This talk will showcase our group's efforts in developing novel techniques for factorized representation learning across these multiple scales, leading to improved performance in various downstream speech processing tasks. The first part of the talk will introduce our approach to self-supervised representation learning directly from raw audio using a hidden unit clustering (HUC) framework. This computationally efficient method leverages convolutional neural networks (CNNs) for initial time-frequency representation extraction followed by processing with long short term memory (LSTM) layers. We'll delve into techniques employed to enhance speaker invariance in these learned representations. The efficacy of our approach will be demonstrated through its application in two distinct settings: completely unsupervised speech tasks within the ZeroSpeech 2021 challenge and semi-supervised automatic speech recognition (ASR) on the TIMIT and GramVaani challenge Hindi datasets. Notably, our method achieves state-of-the-art results for various Zero-speech tasks (as of 2023). The second part will shift focus to our recent "Learn2Diss" framework, designed for learning disentangled speech representations. We will discuss its architecture, comprising separate frame-level and utterance-level encoder modules, and detail the disentanglement process using a mutual information-based criterion. Through comprehensive evaluations on various downstream tasks, including those from the SUPERB challenge, we demonstrate the superior performance of Learn2Diss. Finally, we will touch upon related work in zero-shot emotion conversion and conclude by outlining future research avenues for these promising research streams.
Sriram Ganapathy is an Associate Professor at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the Learning and Extraction of Acoustic Patterns (LEAP) lab. He is also a visiting research scientist at Google Research India, Bangalore. His research interests include signal processing, machine learning methodologies for speech and speaker recognition and auditory neuroscience. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland. Over the past 15 years, he has published more than 120 peer-reviewed journal/conference publications in the areas of deep learning, and speech/audio processing. Dr. Ganapathy currently serves as the IEEE Sigport Chief Editor, member of the IEEE Education Board, and functions as subject editor for Elsevier Speech Communication Journal. He is also a recipient of several awards including Department of Science and Technology (DST) Early Career Award in India, Department of Atomic Energy (DAE), India Young Scientist Award and Verisk AI Faculty Award. He is a senior member of the IEEE Signal Processing Society and a member of the International Speech Communication Association (ISCA).
Associate Professor of Computer Science and Engineering,
Title: Text-only Adaptation of End-to-End Speech Recognition Models
Abstract: End-to-end Automatic Speech Recognition (ASR) systems are a mainstay in modern speech applications. Text-only adaptation of end-to-end ASR systems to new target domains is of great practical relevance since in many domains it is easier to get text-only data compared to the corresponding speech. However, this is a challenging problem for end-to-end ASR that learns a joint mapping from speech to text without any explicit decoupling of acoustic and language models. In this talk, we will discuss three approaches that address the problem of text-only adaptation of end-to-end ASR in three very different settings. First, we present TOLSTOI, where we impute speech representations for text-only data in a target domain and perform in-situ adaptation without incurring any runtime overheads during decoding. Next, we present PRISM, an inference-time technique to adapt an ASR system to predefined dictionaries at test-time with no additional training. Finally, we introduce SALSA, a new lightweight framework that allows the coupling of decoder layers of a pretrained ASR and a text-only large language model (LLM) to improve ASR for a diverse set of low-resource languages.
Preethi Jyothi is an Associate Professor in the CSE department, IIT Bombay. She was a Beckman Postdoctoral Fellow at the University of Illinois at Urbana-Champaign from 2013-2016. She received her Ph.D. in computer science from The Ohio State University. She obtained a B.Tech from the National Institute of Technology, Calicut in 2006, where she was awarded the gold medal for being the top graduating student in computer science. Her research interests are broadly in the areas of machine learning as applied to speech and exploring the interaction between speech and text. Her Ph.D. thesis dealt with statistical learning methods for pronunciation models. Her work on this topic received a Best Student Paper Award at INTERSPEECH, 2012. She co-organised a research project on probabilistic transcriptions at the 2015 Jelinek Summer Workshop on Speech and Language Technology. For this work, her team received a Speech and Language Processing Student Paper Award at ICASSP 2016. Since joining IITB, she was awarded a Google Faculty Research Award 2017 for her proposal on accented speech recognition. She also led a team that received the "Best Project" award at Microsoft Research India’s Summer Workshop on Artificial Social Intelligence in 2017. She currently serves on the ISCA SIGML board, and is a member of the Editorial Board of Computer Speech and Language, Elsevier.
Senior Manager, Medical Research Department,
Kokilaben Dhirubhai Ambani Hospitals and Research Center, Mumbai.
Title: Ethics in Research
Abstract: Ethics is an important component of any research, be it academic or clinical research. As the world today takes giant strides in science, technology and pioneering research, the credibility of the research community and the perception of the common man to accept new results firmly depends on the authenticity, accuracy and reliability of the results that have been published. It is important and crucial for researchers to be aligned and updated with the different guidelines and regulations to be followed when undertaking any research. This presentation will try to throw some light on the different guidelines in research, the role of Research Ethics Committees and will provide an insight on the process of submission of documents to the Ethics Committees.
Aparna Walanj received MBBS, DCH, PGDCR, and PGDBA. Presently, she is Senior Manager at Medical Research Department, Kokilaben Dhirubhai Ambani Hospitals and Research Center, Mumbai. She is also visiting faculty for several clinical research institutes in Mumbai. She has 15 years of clinical research experience, supervising and assisting in the conduct of clinical research studies in various organizations, such as HCAH, Sarathi, Unichem Labs, Ethika Clinical Research, and Sapphire Hospitals. She support in training and guidance of research coordinators related to the research processes, guidelines, and regulations; conducting review of patient documents and approving patients as per set criteria for eligibility in the research programs, and cordinating with consultants and research experts from CROs and sponsors conducting domestic and global clinical studies. She has expertise in developing Investigator Site and Ethics Committee SOPs, and conducted trainings and workshops on ICH GCP for Investigators and Research Site staff. She is supervising the research team for all activities from study site feasibility till study close out and overseeing the quality audits, sponsor visits, and various accreditation audits related to clinical research. She is member of Indian Society of Clinical Research, and Rotary Club of Thane Green City (RCTGC) and served RCTGC in various capacities, such as President, Secretary, E Administrator, and Environment Director.
Title: Dysarthric ASR: Assistive Speech Technology
Abstract: Dysarthria is a speech disorder stemming from difficulties in controlling relevant muscles involved in natural speech production mechanism and thus, poses formidable challenges to dysarthric patients for effective communication. This disorder can happen due to various reasons, such as brain injury, brain tumour, stroke, and nervous system disorder including cerebral palsy, Parkinson’s disease or Amyotrophic Lateral Sclerosis (ALS). To that effect, assistive technologies, such as dysarthric ASR can help to convert spoken words by the patients into text which can be easier for others to understand and thus, assist patients to communicate and participate in conversations. This talk will first present various challenges associated with processing of dysarthric speech, in particular, spectrographic vs. Linear Prediction (LP) analysis, shifts in formants and their – 3 dB bandwidths. Generally, formants are shifted to higher frequency region due to decrease in length of vocal tract system stemming from comprised contraction and relaxation of muscles for dysarthric patients. Further, as part ongoing efforts of the National Language Translation Mission (NLTM) consortium sponsored by MeitY, Govt. of India, the talk will review various dysarthric ASR systems reported in the literature including the recent work on noise robust whisper features using different classifier models, such as LSTM, BiLSTM, and BiGRU. Finally, talk will also discuss the significance of dysarthric severity-level classification system (as pre-processing) by invoking severity–specific ASR models to improve performance of dysarthric ASR system.
Hemant A. Patil received a Ph.D. degree from the Indian Institute of Technology (IIT), Kharagpur, India, in July 2006. Since 14 Feb. 2007, he has been a faculty member at DA-IICT Gandhinagar, India and developed Speech Research Lab recognized as ISCA Speech Labs at DA-IICT. He has published/submitted around 320+ research publications in international conferences/journals/book chapters. He visited the department of ECE, University of Minnesota, Minneapolis, USA (May-July, 2009) as a short-term scholar. He has been associated (as PI) with three MeitY-sponsored projects in ASR, TTS, and QbESTD. He was co-PI for DST sponsored project on India-Digital Heritage (IDH)-Hampi. His research interests include speech and speaker recognition, analysis of spoofing attacks, audio deepfake detection, TTS, and Assistive Speech Technologies, such as infant cry and dysarthric speech classification and recognition. He has received the DST Fast Track Award for Young Scientists for infant cry analysis. He has coedited four books with Dr. Amy Neustein (EIC, IJST Springer) with titles, Forensic Speaker Recognition (Springer, 2011), Signal and Acoustic Modeling for Speech and Communication Disorders (DE GRUYTER, 2018), Voice Technologies for Speech Reconstruction and Enhancement (DE GRUYTER, 2020), and Acoustic Analysis of Pathologies from Infant to Young Adulthood (DE GRUYTER, 2020). Recently, he is selected as Associate Editor for IEEE Signal Processing Magazine (2021-2023). Prof. Patil has also served as a PRSG Member for three MeitY-sponsored projects, namely, “Speech-to-Speech Translation & Performance Measurement Platform for Broadcast Speeches and Talks (e.g., Mann Ki Baat)”, “Indian Languages Speech Resources Development for Speech Applications”, and “Integration of 13 Indian Languages TTS Systems with Screen readers for Windows, Linux, and Android Platforms”.
Dr. Patil has taken a lead role in organizing several ISCA-supported events@DA-IICT, such as summer/winter schools/CEP workshops. Dr. Patil has supervised 08 doctoral and 56 M.Tech. theses (all in the speech processing area). Presently, he is mentoring 01 doctoral scholar and 01 MTech student. Dr. Patil is also co-supervising UG and master's students jointly as part of the Samsung PRISM program at DA-IICT. Recently, he offered a joint tutorial with Prof. Haizhou Li (IEEE Fellow and ISCA Fellow) during APSIPA ASC 2017, and INTERSPEECH 2018. He offered a joint tutorial with Prof. Heidiki Kawahara (IEEE Fellow and ISCA Fellow) on the topic, “Voice Conversion: Challenges and Opportunities,” during APSIPA ASC 2018, Honolulu, USA. He spent his Sabbatical Leave at Samsung R&D Institute, Bengaluru from May 2019 to August 2019. He has been selected as an APSIPA Distinguished Lecturer (DL) for 2018-2019, and he has 25+ APSIPA DLs in four countries, namely, India, Singapore, China, and Canada. Recently, he was selected as an ISCA Distinguished Lecturer (DL) for 2020-2022 and delivered 28+ ISCA DLs in India, USA, and Malaysia.