For any queries email: s4p.daiict@gmail.com
Prof. (Dr.) Akihiko K. Sugiyama
Founder, Damascus Corporation, Tokyo, Japan
Professor, Kansai University, Japan
Talk 1: Mechanical Noise Suppression: Debutant Of Phase In Signal Enhancement After 30 Years Of Silence
Abstract: This talk presents challenges, solutions, and applications in commercial products of mechanical noise suppression. The topic has become more important as dissemination of consumer products that process environmental signals in addition to human speech. Three typical types of mechanical noise signals with small, medium, and large signal power, represented by feature phones and camcorders, digital cameras, and standard and tablet PCs, respectively, are covered. Mechanical noise suppression for small power signals is performed by continuous spectral template subtraction with a noise template dictionary. Medium power mechanical noise is suppressed in a similar manner only when its presence is notified by the parent system such as the digital camera. When the power is large, explicit detection of the mechanical noise based on phase information determines suppression timings. In the all three scenarios, the phase information of the input noisy signal is randomized for making the residual noise inaudible in frequency bins where noise is dominant. The phase has been unaltered in the past 30 years after Lim, thus, these suppression algorithms opened the door to a new signal enhancement era. Sound demonstrations before and after suppression highlight the effect of the algorithms and make the talk engaging
Talk 2: Personal Information Devices: Portable To Wearable, Stand-alone To Connected, Players To Sensors
Abstract: This talk presents a history of personal information devices. The origin is an audio player dated back to the 1990s which was born at an intersection of audio coding algorithms to provide sufficient subjective audio quality and a sufficient memory size on a single chip. LSI technology was indispensable to its birth which had a revolutionary impact on the hardware business. The audio-only device was naturally extended to include video signals to cover multimedia applications commonly encountered today in our daily life. Integration with a mobile phone brought us continuous extensions to wearables, connected operations, and sensing functions.
Talk 3: Phase-Based Time-Frequency Filtering as an Alternative to the Classical Beamforming
Abstract: This talk presents phase-based time-frequency filtering as an alternative to the classical beamforming. The classical beamforming is decomposed into direction-of-arrival estimation and direction-based attenuation. This decomposition makes the design of directivity pattern free from the sensor arrangement, enabling a sharp beam with a small number of sensors. Audio beamforming for PC applications is presented as an example with a design technique for a constant beam-width across the frequency in multiple channels. Successful evaluation results confirm the constant beam-width design.
Talk 4: Linear Microphone Array Parallel to the Driving Direction for In-Car Speech Enhancement
Abstract: This talk presents a linear microphone array parallel to the driving direction for in-car speech enhancement. In contrast to other linear microphone arrays in the car cabin reported in a literature or implemented as a commercial product, the array axis is arranged in parallel to the driving direction. Thanks to the 90o-rotated array axis with the constraints on the microphone position specific to the car environment, a mirror image of the directivity toward the talker with respect to the array axis is no longer projected in the direction of interference but redirected to a direction with no interference. As a result, the talker speech can be discriminated from the interference by directivity, leading to good interference reduction with little speech distortion. Simulation results confirms this position.
Talk 5: IEEE Fellow Elevation: Keys to Success
Abstract: This talk presents how to prepare the nomination when one is nominated for IEEE Fellow. There are some key considerations to maximize the chances of success when a nomination is prepared. IEEE has a clear guideline to write an effective nomination which most nominators do not refer to. Nominations in line with the guide makes the nominator/nominee confident about the nomination and simultaneously makes the evaluators comfortable in the process of evaluation through easy understanding and comparison. The talk is based on the presenter’s experiences as a member of the IEEE Fellow Committee to make the final decision and a Society Fellow Evaluation Committee to perform the initial evaluation as well as a nominator/reference/endorser. The considerations in the talk such as items to be included, descriptions of the accomplishments, and the order of presentation are useful for other occasions when one would like to appeal accomplishments for Senior-Member elevation, award nomination, and promotion in the affiliation.
Prof. (Dr.) Akihiko Sugiyama has 40 years of experience developing telecommunications, speech, and audio signal processing systems for consumer and network system products. In addition to proven record of technology adoption in products and international standards as well as publication and granted patents, marketing and sales experience to develop over 300 new contacts in the world in two years and proof-of-concept (PoC) evaluations with world-leading companies for technical licensing are unique as a research engineer. Once representing Japan for ISO/IEC MPEG Audio standardization including an Interim Chair of the Audio Subgroup at the Angra dos Reis Meeting in Brazil, experiences extend to ITU and 3GPP standardizations as a delegate. In 1994, his team developed the world’s first all-solid-state portable audio player, a precursor of iPod/iPhone, which was widely reported in media such as Time Magazine. At Expo2005 in Aichi, he showed feasibility of speech recognition in the noisy exhibit environment for the first time in the world through a personal robot, PaPeRo, Established a bridging career between industry and academia through 25 year teaching experience at universities and supervision of 75 internship students. Guidance and encouragement for 5 research engineers direct report to him received a D. Eng. or a Ph.D degree for what they have accomplished under his supervision without being enrolled in a university course. Contributed to 17 chapters in books, drafted 4 Japanese Industrial Standards, delivered 198 invited talks in 91 cities in 31 countries, and received 23 awards. The sole inventor or a co-inventor of 273 registered patents in Japan and overseas as well as 6 registered trademarks. Fellow of IEEE as well as Honorary Member and Fellow of IEICE. A Distinguished Lecturer for Signal Processing Society (2014-2015) and Consumer Electronics Society (2017-2018), and a Distinguished Industry Speaker for Signal Processing Society (2020-2021), IEEE. Recognized as a Renowned Distinguished Speaker (The Rock Star) in 2020 by Consumer Electronics Society, IEEE. Served as a Member of IEEE/RSE James Clerk Maxwell Medal Committee (2022-2024) and IEEE Fellow Committee (2018-2020, 2022-2023).
Professor, The University of Sheffield, UK.
Talk 1: Selecting Data for Semi-Supervised ASR
Abstract: Training of ASR models has long followed the path of multi-style training, i.e. more diverse data is better data. Labelled data is still hard to come by, hence semi-supervised training is often used. In contrast the amounts of unlabeled data available now can be vast - and the question of data selection may be important again. In this talk we briefly review standard strategies for semi-supervised training and data selection. We then move on to present recent work on data selection using new methods for word error rate estimation and present results on ASR training.
Talk 2: Multilingual speech recognition - Modelling the relationship between languages
Abstract: Multilingual speech recognition is now commonly used in ASR systems such as Whisper or foundation models such as XLSR. Models are simply trained using data from many languages, and possibly joint tokenization of different writing scripts. In this talk we briefly review the history of modelling followed by recent work on trying to understand the relationship between languages with the aim to make progress towards representing the 7000 languages of the world, where most are under-resourced. Simple mapping models are shown to better understand and model relationships between languages and thus allow better generalisation to new low resource languages.
Talk 3: Self-supervised Models for Robust Speech Content Representations
Abstract: Self supervised models have revolutionised language and speech processing. The fact that unlabelled data can be used to inform and bootstrap models for a vast range of tasks has given rise to a completely different view of speech technology. However even though most models are trained on large amounts of data, domain generalisation can be poor. Model training is very typically very costly and fine-tuning a task may not lead to good results because of domain mismatches. In this presentation some properties of SSL derived methods were explored, leading to novel ways to fine tune models to a domain and content oriented task types. Instead of using model specific loss functions generic alignment loss allows for fast fine tuning with much lower computational cost.
Prof. (Dr.) Thomas Hain is Professor for Computer Science at the University of Sheffield. He holds the degree `Dipl.-Ing' in Electrical and Communication Engineering from the University of Technology, Vienna, and a PhD in Information Engineering from Cambridge University (2002). He is a Senior Lecturer at the Speech and Hearing Group, University of Sheffield, Sheffield, U.K., and has worked on speech processing for more than 15 years. His main interests are in speech processing, machine learning, and natural man/machine interfaces. During undergraduate studies he received a scholarship to conduct parts of his studies at RWTH Aachen, Germany. After receiving a first class degree from the University of Technology Vienna (top 5%) he worked at Philips Speech Processing, Vienna where he left as Senior Technologist in 1997 to joined the Cambridge University Engineering Department as a PhD student and unusually, Research Associate at the same time. Shortly before graduating as PhD student he was appointed to Lecturer at Cambridge University in 2001. Prof Hain moved to Sheffield University in 2004 to become a member of the Speech and Hearing Research Group (SpandH). After a series of intermediary promotions he was appointed as Full Professor in 2013. Since 2009 He leads the 15-strong subgroup on Machine Intelligence for Natural Interfaces and in 2016 he took the role of Head of SpandH - a group consisting of 7 academics, 10 postdoctoral researchers, and more than 30 PhD students. In 2016 Prof Hain also became a member of the Machine Learning Research group.
He has more than 170 publications on Machine Learning and speech recognition topics (Google citations 9.5k, h-index 33). In addition to membership of many technical committees, including repeated positions of area chair at ICASSP, Interspeech, and ICPR, and he as been organising committee member of Interspeech 2009, IEEE ASRU 2011 and 2013. He is also an ISCA Fellow. He is also one of the key organisers and designated technical chair of Interspeech 2019. He was as Assoc. Editor of ACM Transactions on Speech and Language Processing, and is currently a member of the editorial board of Computer Speech & Language. He also currently serves as Area Chair for Speech and Language technology at ICASSP 2017 and was elected to the IEEE Speech Technical Committee in second term.
Prof. Hain has been an investigator on more than 20 projects, funded by FP6, FP7, EPSRC, DARPA, and industry, with a cumulative research budget of £ 9M (?4M as PI). Until recently he served as PI at Sheffield for the EPSRC programme grant NST (total budget £ 6.2M, Sheffield £ 2.2M), and currently works on industrial projects from Google, dstl, and ITSLanguage. Current projects include MAUDIE (Innovate UK), Tuto (industry) and BiT(University). Prof. Hain is currently setting up a centre funded by industry that will host an additional 10 researchers in speech and language processing.
Talk 1: ASR - from input data to industrial applications
Abstract: This lecture examines the key challenges involved in deploying Automated Speech Recognition (ASR) systems, covering the entire pipeline from data collection to model training, execution and iterative assessment. It will address critical considerations such as data transcription strategies, commonly used tools in both research and industry, data privacy concerns, licensing of available resources, and the trade-offs between offline and streaming solutions. The lecture will also explore ASR performance metrics and discuss scalability challenges in real-world applications.
Talk 2: ASR - from HMM/GMMs to LLM-based engines
Abstract: This lecture offers a high-level overview of the key challenges in achieving high-accuracy Automatic Speech Recognition (ASR) systems. It will begin by introducing foundational ASR concepts widely used in recent decades, such as HMM/GMM-based approaches, and progress toward the latest advancements involving the integration of ASR with large language models (LLMs). The session will also highlight recent developments in speech pre-processing, including voice activity detection, handling multi-speaker scenarios, speaker diarization or mapping of the ASR output with additional information available for given use-case. Finally, the lecture will showcase specific industrial applications where ASR technologies play a critical role.
Talk 3: ASR - contextuallisation of ASR systems
Abstract: This lecture will focus on current approaches for contextualizing ASR output using prior knowledge. ASR systems are often tailored for specific applications where auxiliary data—containing relevant context or domain-specific information—is available to enhance recognition accuracy. The session will conclude with a demonstration of how ASR can be integrated into real-world applications, with a particular focus on air traffic management.
Talk 4: ASR - recognition of apriori unknown words, detection of rare word entities
Abstract: This lecture will address a common requirement from ASR users: how to incorporate new words or named entities into ASR output without retraining the entire system. These terms are typically not present in the original training data and were unknown during the initial training phase. The lecture will explore earlier methods used in traditional hybrid ASR systems, as well as more recent techniques developed for end-to-end architectures. The primary goal is to improve recognition accuracy, particularly for rare or out-of-vocabulary words.
Talk 5: ASR - data selection and learning using weakly labeled data, performance monitoring
Abstract: This lecture will explore methods for iteratively training ASR systems using data drawn from large, readily available sources. While the volume of data is typically not a limiting factor, its quality can vary significantly and it in most cases lacks manual annotations. The lecture will cover strategies for effective data selection, techniques for iterative learning that mitigate catastrophic forgetting, and approaches for training with weakly labeled or noisy data. Eventually the lecture will also consider performance monitoring, including generation of reliable confidence scores as relevant ASR output.
Dr. Petr Motlicek received his M.Sc. in Electrical Engineering and Ph.D. in Computer Science from Brno University of Technology (BUT), Czech Republic, in 1999 and 2003, respectively. In 2000, he conducted research on very low bit-rate speech coding at the École Supérieure d’Ingénieurs en Électrotechnique et Électronique (ESIEE) in Paris, France. From 2001 to 2002, he was a research intern at the Oregon Graduate Institute (OGI) in Portland, USA, where he collaborated with Prof. Hynek Hermansky on distributed speech recognition. Dr. Motlicek currently serves as an Associate Professor at the Faculty of Information Technology, BUT. Since 2005, he has been a Senior Researcher at the Idiap Research Institute in Martigny, Switzerland, where he leads a research group focused on voice intelligence. His R&D work spans speech and speaker recognition, with emphasis on advancing technologies for language understanding. He is also an external lecturer at the École Polytechnique Fédérale de Lausanne (EPFL).
Extracting feature representations is an important step in machine learning-based applications. Conventionally, speech processing relied on signal processing methods and application of prior knowledge to extract feature representations. Over the past two to three decades, there has been a push towards developing feature representation extraction methods that combine data and prior knowledge using machine learning, which has eventually led to the development of self-supervised learning-based speech foundation models framework. In this series of presentation, I will present how feature engineering in speech processing has evolved and discuss their pros and cons as well as their similarities and dissimilarities. The complete presentation is organized into four parts, namely,
Talk 1: From spectral feature representations to supervised learning-based feature representations
Abstract: In the first part, I will start with an overview of extraction of features using signal processing techniques and their modeling by different distribution modeling methods for automatic speech recognition, and show how this led to different supervised learning based feature representations such as, tandem feature and auto-association/auto-encoder features.
Talk 2: End-to-end acoustic modeling
Abstract: In the second part, I will present an end-to-end acoustic modeling method pioneered at Idiap, where raw waveform is directly modeled by neural network in a task dependent manner. I will provide links to conventional signal processing techniques and show how these kind of neural networks can be analyzed to gain insight into the information captured by them.
Talk 3: Self-supervised learning (SSL) based representation learning for speech processing
Abstract: In this part, I will start with an overview of self-supervised learning based feature representation learning methods. I will then present recent works at Idiap on self supervised feature representation based speech synthesis and voice conversion to demonstrate how this leads to new directions where speech synthesis and speech recognition/assessment can be put in loop. Specifically, the talk will focus on (a) multispeaker speech synthesis, (b) unsupervised rhythm and voice conversion for improving dysarthric speech recognition and (c) children voice privacy.
Talk 4: Statistical interpretation of the SSL-based representation learning
Abstract: In this talk, I will present an on-going work at Idiap to show the link between classical approaches to model feature distributions and self-supervised learning based models. Through this link, I will provide a statistical interpretation of SSL models and show how different pre-trained models like wav2vec2, HuBERT, wavLM and Whisper can be analyzed and distinguished, and how the information learned by them could be interpreted. This talk will conclude by drawing parallels between past methods and current methods from statistical pattern recognition point of view and providing suggestions for future research.
Mathew Magimai Doss received the Bachelor of Engineering (B.E.) in Instrumentation and Control Engineering from the University of Madras, India in 1996; the Master of Science (M.S.) by Research in Computer Science and Engineering from the Indian Institute of Technology, Madras,India in 1999; the PreDoctoral diploma and the Docteur ès Sciences (Ph.D.) from the Ecole polytechnique fédérale de Lausanne (EPFL), Switzerland in 2000 and 2005, respectively. He was a postdoctoral fellow at the International Computer Science Institute (ICSI),Berkeley, USA from April 2006 till March 2007. He is now a Senior Researcher at the Idiap Research Institute, Martigny, Switzerland. He is also a lecturer at EPFL. His main research interest lies in signal processing, statistical pattern recognition, artificial neural networks and computational linguistics with applications to speech and audio processing, sign language processing and multimodal signal processing. He is a member of IEEE, ISCA, and Sigma Xi. He is an Associate Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is the coordinator of SNSF Sinergia project SMILE-II, which is focusing on continuous sign language recognition, generation and assessment. He was the coordinator of recently completed H2020 Marie Sklodowska Curie Actions ITN-ETN project TAPAS, which focused on pathological speech processing. He has published over 180 journal and conference papers. The Speech Communication paper "End-to-end acoustic modeling using Convolution Neural Networks for HMM-based Automatic Speech Recognition" co-authored by him and published in 2019 received the 2023 EURASIP Best Paper Award for SPEECH COMMUNICATION Journal and ISCA Award for the Best Paper published in Speech Communication (2017-2021). The Interspeech 2015 paper "Objective Intelligibility Assessment of Text-to-Speech Systems Systems through Utterance Verfication" received one of three best student paper award.
Research Fellow (Professor)/ Deputy Director,
Research Center for Information Technology Innovation, Academia Sinica, Taiwan.
Talk: Neural Speech Enhancement and Assessment and Their Applications in Assistive Oral Communication Technologies
Abstract: This presentation is divided into three parts. Firstly, we will discuss our recent advancements in neural speech enhancement (SE), a critical element in various speech-related applications. The primary objective of SE is to enhance speech signals by mitigating distortions caused by additive and convoluted noises, thereby improving human-human and human-machine communication efficacy. We'll delve into the system architecture and fundamental theories behind neural SE approaches, as well as explore important directions aimed at achieving better performance. Moving on to the second part, we will focus on our recent progress in neural speech assessment (SA), which aims to effectively evaluate the quality and intelligibility of spoken audio—a crucial aspect in numerous speech-related applications. Traditionally, the evaluation process often relies on listening tests involving human participants, which can be both resource-intensive and impractical due to the need for a large number of listeners. To address this challenge, neural SA metrics have garnered notable attention. We will discuss the fundamental systems of neural SA, highlight several factors influencing performance, and explore emerging trends in this domain. Finally, we will present some applications of neural SE and SA in assistive oral communication technologies. These applications include impaired speech transformation and noise reduction for assistive hearing and speaking devices. Through these discussions, our aim is to illustrate the potential impact of neural-based approaches in improving communication accessibility for individuals with oral communication disorders.
Yu Tsao (Senior Member, IEEE) received his B.S. and M.S. degrees in Electrical Engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and his Ph.D. degree in Electrical and Computer Engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher at the National Institute of Information and Communications Technology, Tokyo, Japan, where he worked on research and product development for automatic speech recognition in multilingual speech-to-speech translation. He is currently a Research Fellow (Professor) and the Deputy Director at the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. Additionally, he serves as a Jointly Appointed Professor in the Department of Electrical Engineering, Chung Yuan Christian University, Taoyuan, Taiwan. Dr. Tsao’s research interests include assistive oral communication technologies, audio coding, and bio- signal processing. He is an Associate Editor for both IEEE Transactions on Consumer Electronics and IEEE Signal Processing Letters. In recognition of his contributions, he received the Outstanding Research Award from the National Science and Technology Council (NSTC)—the most prestigious research honor in Taiwan—in 2023. His other accolades include the Academia Sinica Career Development Award in 2017, multiple National Innovation Awards (2018–2021 and 2023), the Future Tech Breakthrough Award in 2019, the Outstanding Elite Award from the Chung Hwa Rotary Educational Foundation (2019–2020), and the NSTC FutureTech Award in 2022. Additionally, he is the corresponding author of a paper that won the 2021 IEEE Signal Processing Society (SPS) Young Author Best Paper Award.
Generative AI Group Leader,
A*STAR - Agency for Science, Technology and Research, Singapore.
Talk: Multimodal, Multilingual Generative AI: From Multicultural Contextualization to Empathetic Reasoning
Abstract: In this seminar, we will take a historical view on large language models from a speech technology lens and draw R&D examples from initiatives such as MeraLion (Multimodal Empathetic Reasoning and Learning In One Network), our generative AI efforts in Singapore’s National Multimodal Large Language Model Programme. Speech and audio information is rich in providing more comprehensive understanding of spatial and temporal reasoning in addition to social dynamics that goes beyond semantics derived from text alone. Cultural nuances and multilingual peculiarities add another layer of complexity in understanding human interactions. In addition, we will draw use cases in education to highlight research endeavors, technology deployment experience and application opportunities.
Dr. Nancy F. Chen received her Ph.D. from MIT and Harvard in 2011. She worked at MIT Lincoln Laboratory on her Ph.D. research in multilingual speech processing. She is currently a fellow, senior principal scientist, group leader, and principal investigator at the Institute for Infocomm Research (I2R) and CFAR (Centre for Frontier AI Research), A*STAR (Agency for Science, Technology, and Research), Singapore. She is leading research efforts in generative AI, conversational AI, natural language generation and machine learning with applications related to education, healthcare, journalism, and defense. Speech evaluation technology developed by her team is deployed at the Ministry of Education in Singapore to support home-based learning during the COVID-19 pandemic. Dr. Chen also led a cross-continent team for low-resource spoken language processing, which was one of the top performers in the NIST Open Keyword Search Evaluations (2013-2016), funded by the IARPA Babel program.
Dr. Chen has received numerous awards, including 2023 IEEE SPS Distinguished Lecturer, EMNLP Outstanding Paper Award (2023), Singapore 100 Women in Tech (2021), Young Scientist Best Paper Award at MICCAI 2021, Best Paper Award at SIGDIAL 2021, the 2020 P&G Connect + Develop Open Innovation Award, the 2019 UNESCO L’Oréal Singapore For Women in Science National Fellowship, Best Paper at APSIPA ASC (2016), MOE Outstanding Mentor Award (2012), the IEEE Spoken Language Processing Grant (2011): Microsoft-sponsored Outstanding Paper Award at ICASSP, and the NIH (National Institute of Health) Ruth L. Kirschstein National Research Award (2004-2008).
Dr. Chen has given international keynotes at natural language processing, machine learning, and speech technology venues, including 2023 International Natural Language Generation Conference (INLG) and Special Interest Group on Discourse and Dialogue (SIGDIAL), 2023 Conference on Machine Learning (ACML), 2023 Workshop on Speech and Language Technology for Education (SLaTE), 2023 Conference on Computational Linguistics and Speech Processing, 2023 Workshop on NLP for Medical Conversations @ IJCNLP (International Joint Conference on NLP).
Dr. Chen has been active in the international research community with the following services: program chair of IEEE CAI (Conference on AI), APSIPA Board of Governors (2024-2026), program chair of ICLR (International Conference on Learning Representations) 2023, elected board member of ISCA (International Speech Communication Association) 2021-2025, senior area chair of AACL (2022), area chairs of ACL and EMNLP (2021), an elected member of the IEEE Speech and Language Technical Committee (2016-2018, 2019-2021), senior editor of IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023 - present), senior area editor of Signal Processing Letters (2021-2023), associate editor (2020-2023) of IEEE/ACM Transactions on Audio, Speech, and Language Processing, associate editors of Neurocomputing (2020-2021), Computer Speech and Language (2021- present), IEEE Signal Processing Letters (2019-2021) and was the guest editor for the special issue of “End-to-End Speech and Language Processing” in the IEEE Journal of Selected Topics in Signal Processing (2017).
In addition to her academic endeavors, technology from her team has also resulted in spin-off companies such as nomopai to help engage customers with confidence and empathy. Dr. Chen has also consulted for various companies ranging from startups to multinational corporations in the areas of climate change (social impact startup normal), emotional intelligence (Cogito Health), EdTech (Novo Learning), speech recognition (Vlingo, acquired by Nuance), and defense and aerospace (BAE Systems).
Professor, Kyoto University, Japan.
Talk: Universal Speech Recognition Using IPA And Articulatory Features
Abstract: While the end-to-end framework has achieved remarkable advancements in automatic speech recognition (ASR), it is heavily optimized for the training dataset and lacks flexibility for multi-lingual ASR, particularly in low-resource languages. The problem is significantly mitigated by the SSL-pretrained models, such as X-LSR, but not completely solved. An alternative solution to universal ASR is to adopt language-independent tokens such as IPA (International Phonetic Alphabet). Since IPA is defined by the articulatory features, it is possible to incorporate the knowledge of articulatory features during the training. This talk addresses several approaches in this direction.
Prof. Tatsuya Kawahara received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1987, 1989, and 1995, respectively. From 1995 to 1996, he was a Visiting Researcher with Bell Laboratories, Murray Hill, NJ, USA. He was also an Invited Researcher with ATR and NICT. He is currently a Professor with the School of Informatics, Kyoto University. From 2020 to 2023, he was the Dean of the School. He has authored or coauthored more than 450 academic papers on automatic speech recognition, spoken language processing, and spoken dialogue systems. He has been conducting several projects including open-source speech recognition software Julius, the automatic transcription system deployed in the Japanese Parliament (Diet), and the autonomous android ERICA. He was the recipient of the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT) in 2012. From 2003 to 2006, he was a member of IEEE SPS Speech Technical Committee. He was a General Chair of IEEE ASRU 2007 and is a General Chair of SIGdial 2024. He was also a Tutorial Chair of INTERSPEECH 2010, a Local Arrangement Chair of ICASSP 2012, and a General Chair of APSIPA ASC 2020. He was an editorial board member of Elsevier Journal of Computer Speech and Language and IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING. From 2018 to 2021, he was the Editor-in-Chief of APSIPA Transactions on Signal and Information Processing. He is the IEEE Fellow, President of APSIPA and the Secretary General of ISCA.
Emeritus Professor,
International Institute of Information Technology (IIIT), Hyderabad, India.
Talk1: Challenges in Processing Natural Signals Like Speech
Abstract: Will update soon.
Talk2 : Speech Signal Processing Using Single Frequency Filtering (SFF)
Abstract: Signal processing in general, and speech signal processing in particular, is normally performed using block processing methods, like discrete Fourier transform. Frame- based block processing of signals has some disadvantages, especially in processing the phase spectral component. Filtering-based methods can be explored as an alternative for processing speech signals. In this presentation, we will discuss single frequency filtering (SFF) method for speech signal processing, especially for extracting speech production information from the phase component. Starting with the basics of signals and systems for discrete time signals, this talk presents the main ideas of SFF that are useful in extracting the time-varying formants and pitch harmonics contours from speech signals. The results will be demonstrated for speech signals from different types of voices.
Prof. (Dr.) Bayya Yegnanarayana, is currently INSA Hon Scientist at IIIT Hyderabad. He was Professor Emeritus at BITS-Pilani Hyderabad Campus during 2016. He was an Institute Professor from 2012 to 2016 and Professor & Microsoft Chair from 2006 to 2012 at IIIT Hyderabad. He was a professor at IIT Madras (1980 to 2006), a visiting associate professor at Carnegie-Mellon University, Pittsburgh, USA (1977 to 1980), and a member of the faculty at the Indian Institute of Science (IISc), Bangalore, (1966 to 1978). His research interests are in signal processing, speech, image processing and neural networks. He is Fellow of INAE, INSA, IASc, IEEE, ISCA, APAS, and a Life Fellow of IIT Kharagpur. He has published over 400 papers, and supervised over 80 theses at Master and PhD levels. He is also an adjunct faculty at IIT Tirupati.
Emeritus Professor (Honorary),
Indian Institute of Technology (IIT), Madras, India.
Talk: Signal Processing Guided Machine Learning In Various Domains
Abstract: The buzzword for building applications of practical relevance is “big data” today. This has led to a separate field called “Data Science” being offered by arious universities, and Institutes. The field of “data science” has grown to accommodate for the variability in the underlying statistical structure that exists natural signals. Both classical machine learning and deep learning rely on the availability of large amount of a wide variety of data. Deep learning models which are massive neural networks ultimately learn the underlying structure of data. While Deep Learning has revolutionized machine learning, in this talk we focus on the use of signal processing to preprocess or mine existing data, so that accurate data is presented to machine learning models. Domain specific signal processing has the capability to identify events in a signal. The event itself may have varying statistical characteristics. Presentation of detected events to machine learning enables faster convergence, and a smaller data footprint. We draw examples from Speech, Music and EEG signals to show that “signal pro- cessing and machine learning” must work together to build systems of relevance for a given domain.
Prof. (Dr.) Hema A. Murthy is Emeritus Professor (Hon) at IIT Madras. She was appointed to the ISCA Advisory Council in 2024 and also served as a board member at ISCA from 2017 to 2021. She has received several awards, namely the IBM Faculty Award (2006), Text-to-Speech Synthesis in Indian Languages, Manthan Award Finalist (Top 74 out of ~450 projects, 2012), Prof. Rais Ahmed Memorial Lecture Award (Acoustical Society of India, 2012), and Indian Language Text-to-Speech Synthesis Systems Integrated with Screen Readers, GE Innovation Award, and First Prize in Research Expo, Shaastra 2013. Her research interests include speech processing, recognition, synthesis, computational brain research, and music processing. She is a Fellow of INAE, IEEE, and ISCA. She has published over 300 papers and supervised 20 Ph.D. theses and 48 Master's theses.
Professor
Department of Electrical Engineering, Indian Institute of Technology (IIT), Madras, India.
Talk 1: Recent Advances in ASR of Indian Languages
Abstract: In this talk, I will give an overview of the current work on ASR in Indian Languages at SPRING LAB, IIT Madras. I will give an overview of the three broad architectures including encoder-decoder, CTC and transducer based approaches to ASR. This will be followed by details of our efforts to collect speech data and build ASR models in various Indian languages. All of our models and data are available in open source, and I will give a demo (https://asr.iitm.ac.in/demo/home) of the ASR systems as well as speech-2-speech translation system by pipelining our ASR and MT systems with a TTS.
Talk 2: Speech Foundation Models for ASR in Indian languages
Abstract: In this talk, I will give an overview of Speech Foundation Models. While the motivation for these models come from text language models, unlike text, the discretisation of speech signal is not straightforward. I will start with contrastive predictive coding ideas, followed by some popular models like wav2vec2.0 and HuBERT. This will be followed by details of recent work from SPRING LAB, where we have proposed two speech foundation models -- ccc-wav2vec2.0 and data2vec-aqc. These models have done exceedingly well in SUPERB challenge and also in a study that did a large scale evaluation of Speech Foundation Models (Yang et. Al IEEE TASLP vol.32, pg. 2884-2899). We are particularly excited since these were just built on 960-hours of data, and yet were competing with bigger models built on 60,000 or 94K hour models. Motivated by the success on American English, we have pretrained ccc-wav2vec2.0 and data2vec-aqc models based on 30,000 hours of Indian languages. These models when fine-tuned for ASR tasks give state-of-art performance for Indian languages. I will wrap the talk with demos of systems developed at SPRING LAB.
Dr. S. Umesh is a professor in the Electrical Engineering Department at the Indian Institute of Technology Madras (IIT Madras). He did his Ph.D. In 1993, from the University of Road Island, USA, he in 1989 from Madras Institute of Technology, M.E. (Hons) received, and B.E. (Hons.) From Birla Institute of Technology and Science, Pilani in 1987. Before joining IIT Madras, Dr. S. Umesh was a faculty member in IIT Kanpur from 1996 to 2009. He has organized several prestigious international research positions, including Rwth-Aachen, Cambridge University, AT & T laboratories and researcher roles in the city of New York University. He also completed Post-Doctoral Fellow Fellowships at City University of New York and University of Rhode Island, USA. His contribution to academics has been recognized with prestigious awards such as AICTE Career Award for Young Teachers (1997) and The Alexander Von Humbold Research Fellowship (2004). His research mainly focuses on automatic speech recognition, Speaker Normalization and Adaptation, Self Supervised Learning, Deep Learning, Machine Learning and Speaker Recognition and Diarisation.
Associate Professor of Electric Engineering, Indian Institute of Science, Bangalore
& Google Research, Bangalore, India
Talk: Demystifying the Black Box: Explainability and Trust in Modern AI
Abstract: As artificial intelligence systems in speech, text and vision, become more complex and opaque, ensuring their interpretability and trustworthiness is essential—especially when users only have black-box access. In this talk, I will detail two recent advancements from our work, that tackle these challenges across vision, audio, and language tasks. First, I will introduce Distillation-Aided Explainability (DAX), a gradient-free framework that generates saliency-based explanations using a learnable mask generation network and a student distillation network. DAX outperforms existing methods across modalities using both objective and human-centric evaluation metrics. This part of the talk will based on the work detailed in IEEE JSTSP24 Second, I will present our recent work on Trust Assessment of LLMs - FESTA (Functionally Equivalent Sampling for Trust Assessment), an unsupervised, black-box technique that estimates model uncertainty by probing input consistency and sensitivity through equivalent and complementary samples. Together, these methods show how we can peek inside the black box—using distillation and input sampling approximations—to build approaches that inspire confidence and understanding of deep learning models and LLMs, as they become ubiquitous in several safety-critical domains.
Sriram Ganapathy is an Associate Professor at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the Learning and Extraction of Acoustic Patterns (LEAP) lab. He is also a visiting research scientist at Google Research India, Bangalore. His research interests include signal processing, machine learning methodologies for speech and speaker recognition and auditory neuroscience. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland. Over the past 15 years, he has published more than 120 peer-reviewed journal/conference publications in the areas of deep learning, and speech/audio processing. Dr. Ganapathy currently serves as the IEEE Sigport Chief Editor, member of the IEEE Education Board, and functions as subject editor for Elsevier Speech Communication Journal. He is also a recipient of several awards including Department of Science and Technology (DST) Early Career Award in India, Department of Atomic Energy (DAE), India Young Scientist Award and Verisk AI Faculty Award. He is a senior member of the IEEE Signal Processing Society and a member of the International Speech Communication Association (ISCA).
Professor, Indian Institute of Technology (IIT), Hyderabad
Talk: Phase Processing Of Speech Signals
Abstract: Phase information, long regarded as secondary to magnitude in speech signal processing, has emerged as a powerful cue for analyzing and interpreting speech. This talk highlights key contributions of phase-based methods, particularly those leveraging the Short-Time Fourier Transform (STFT), in uncovering fine temporal and spectral structures of speech. Techniques based on group delay and instantaneous frequency enable high-resolution representations that are sensitive to vocal tract dynamics and source characteristics. Modified group delay functions, product spectrum analysis, and phase modeling approaches have shown remarkable utility in applications such as formant estimation, voice activity detection, speaker and speech recognition, and glottal event analysis. Despite challenges like phase wrapping and windowing artifacts, phase processing continues to provide complementary and sometimes superior information compared to magnitude-based methods, underscoring its growing importance in modern speech technology.
K. Sri Rama Murty received the B.Tech. degree in electronics and communications engineering from Jawaharlal Nehru Technological University (JNTU), Hyderabad, India, in 2002, and the Ph.D. degree from the Indian Institute of Technology (IIT) Madras, Chennai, India, in 2009. In 2009, he joined, as an Assistant Professor, the Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Hyderabad, India, where he is currently an Associate Professor. His research interests include signal processing, speech analysis, recognition, synthesis, phase processing and modelling and machine learning.
Associate Professor, Department of Electrical Engineering, IIT Kanpur
Talk: Towards Multilingual Speech Tokenization
Abstract: Modern NLP tools, including LLMs, depend on tokens derived from text orthography (written form), which is the conventional written form specific to a language. We present alternative ways to derive tokens directly from speech audio, bypassing orthography. The goal is to obtain universal multilingual tokenization that extracts language-independent features akin to IPA. We will discuss popular tokenization methods, such as speech-to-text ASR and self-supervised learning, and alternative approaches such as audio fingerprinting, wav2tok, and BEST-STD. We will introduce a pairwise training paradigm that circumvents the need for a written form of a language. Finally, we will present our language-agnostic tokenizer, tested across multiple Indian languages, for the word search task.
Vipul Arora (Member, IEEE) received the B.Tech. and Ph.D. degrees in electrical engineering from the Indian Institute of Technology (IIT) Kanpur, Kanpur, India, in 2009 and 2015, respectively. He was a Post-doctoral Researcher with the University of Oxford and a Research Scientist at Amazon Alexa, Boston, MA, USA. He is currently an Associate Professor with the Department of Electrical Engineering, IIT Kanpur. His research interests include machine learn- ing, audio processing, machine learning for physics, and time series analysis.
Prof. (Dr.) Anil Kumar VUPPALA
Associate Professor, Speech Processing Lab, IIIT Hyderabad.
Talk: ASR and SLT in Indian Languages
Abstract: This talk focuses on advancements in Automatic Speech Recognition (ASR) in Indian languages and Spoken Language Translation (SLT). The research highlights the creation of the IIITH-CSTD corpus, a large-scale Telugu speech dataset collected through crowd-sourced strategies, and evaluates different ASR architectures on this corpus. The presentation also delves into SLT, outlining both cascaded and end-to-end models, and introduces "Shruthilipi Anuvaad," a dataset creation pipeline for low-resource Indic-to-Indic speech translation using weakly labeled data. Furthermore, it details the IIITH-BUT system for low-resource Bhojpuri to Hindi speech translation, discussing hyperparameter optimization, data augmentation, and cross-lingual transfer learning techniques.
Anil Kumar Vuppala received his B.Tech. in Electronics and Communications Engineering from JNTU, Hyderabad, India, in 2005, M.Tech. in Electronics and Communications Engineering from NIT, Kurukshetra, in 2007, and PhD in signal processing from IIT Kharagpur, in 2012. From March 2012 to June 2019 he has worked as Assistant Professor at IIIT Hyderabad. From July 2019 onwards he is working as Associate Professor at IIIT Hyderabad. His research interests lie primarily in speech processing in mobile and practical environments. He has published over 100 articles in rupted publications. He is currently handling 2 sponsored projects and completed 9 funded projects. He is guiding 5 full-time PhD students and 9 MS students. He successfully guided 7 PhD students and 13 MS students. He has given more than 100 invited talks in various workshops and conferences.
Talk: On Quantization of Neural Models for Speech Tasks
Abstract: As deep learning models for speech tasks grow in size and complexity, reducing their computational and memory demands becomes critical for efficient deployment, especially on edge devices. Two key strategies to achieve this are model compression and quantization. While model compression focuses on reducing the structural complexity through methods like pruning or distillation, quantization tackles the numerical precision of model parameters, activations, and/or gradients, enabling models to operate with lower bit-widths (e.g., 8-bit instead of 32-bit). This talk will introduce the fundamentals of quantization and discuss why popular methods like post-training quantization (PTQ) and quantization-aware training (QAT) often fall short when applied to modern speech models that include complex components such as channel aggregation, squeeze-and-excitation or attention modules. I will present recent work that addresses these limitations, offering more robust quantization strategies tailored for state-of-the-art speech architectures. The session aims to provide beginner students with a clear understanding of the practical challenges and emerging solutions in making speech models lightweight without compromising accuracy.
Vinayak Abrol is an Assistant Professor at the Department of Computer Science and Engineering & associated with the Infosys Centre for AI at IIIT Delhi, India. Prior to this, he held an Oxford-Emirates data science fellowship at the Mathematical Institute, University of Oxford, and SNSF funded postdoctoral position at IDIAP Research Institute, Switzerland. He received his Ph.D. from the School of Computing and Electrical Engineering, IIT Mandi, India, in 2018. He is a recipient of the TCS PhD fellowship, the JP Morgan & Chase faculty research award, the Google exploreCS award and IIT Mandi's Young Achiever Award, among others. His research focuses on the design and analysis of numerical algorithms for information-inspired applications. His current work focuses on XAI methods for acoustic models and generative speech/audio language modelling.
Professor, Dhirubhai Ambani University (DAU)(formerly DA-IICT), Gandhinagar, India.
Talk: Multi-Lingual Audio DeepFake Detection Corpus
Abstract: Deepfakes are artificially generated fake media using deep learning (DL) methods. Recent study found that deepfakes are challenging to detect even for human listeners, however, machines can do better job in their detection. This talk present development of recent Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) to boost the Audio DeepFake Detection (ADD) research. Existing datasets for ADD suffer from several limitations; in particular, they are limited to one or two languages. Proposed dataset contains 20 languages, which have been released in 4 Tracks (6 - Indian languages, 14 - International languages, 20 languages half-truth data, and combined data). Moreover, the proposed dataset has 400 K files (1,125+ hours) of data, which makes it one of the largest datasets. Deepfakes in MLADDC have been produced using advanced Deep Learning (DL) methods, such as HiFi- GAN and BigVGAN. Another novelty of this corpus lies in its sub-dataset, that has partial deepfakes (Half-Truth). We compared our dataset with various existing datasets, using cross-database method. For comparison, we also proposed baseline accuracy of 68.44%, and EER of 40.9% with MFCC features and CNN classifier (14 languages track only) indicating technological challenges associated with ADD task on proposed dataset. The talk will also discuss some of the open research challenges in this ADD research, more so, in the multilingual context.
Hemant A. Patil received a Ph.D. degree from the Indian Institute of Technology (IIT) Kharagpur, India, in July 2006. Since Feb. 2007, he has been a faculty member at DAU Gandhinagar, India and developed Speech Research Lab recognized as ISCA Speech Labs (the only lab in India) at DAU. He has published/submitted around 350+ research publications in international conferences/journals/book chapters. He visited the department of ECE, University of Minnesota, Minneapolis, USA (May-July, 2009) as a short-term scholar. He has been associated (as PI) with three MeitY-sponsored projects in ASR, TTS, and QbESTD. He was co-PI for DST sponsored project on India-Digital Heritage (IDH)-Hampi. His research interests include speech and speaker recognition, analysis of spoofing attacks, audio deepfake detection, TTS, and Assistive Speech Technologies, such as infant cry and dysarthric speech classification and recognition. He has received the DST Fast Track Award for Young Scientists for infant cry analysis. He has coedited four books with Dr. Amy Neustein (EIC, IJST Springer) with titles, Forensic Speaker Recognition (Springer, 2011), Signal and Acoustic Modeling for Speech and Communication Disorders (DE GRUYTER, 2018), Voice Technologies for Speech Reconstruction and Enhancement (DE GRUYTER, 2020), and Acoustic Analysis of Pathologies from Infant to Young Adulthood (DE GRUYTER, 2020). Recently, he is selected as Associate Editor for IEEE Signal Processing Magazine (2021-2023). Prof. Patil has also served as a PRSG Member for three MeitY sponsored projects, namely, “Speech-to-Speech Translation & Performance Measurement Platform for Broadcast Speeches and Talks (e.g., Mann Ki Baat)”, “Indian Languages Speech Resources Development for Speech Applications”, and “Integration of 13 Indian Languages TTS Systems with Screen readers for Windows, Linux, and Android Platforms”.
Dr. Patil has taken a lead role in organizing 07 ISCA-supported events@DAU, such as summer/winter schools and 02 CEP workshops. Dr. Patil has supervised 08 doctoral and 57 M.Tech. theses (all in the speech processing area). Presently, he is mentoring 01 doctoral scholar and 05 MTech students. Dr. Patil co-supervised UG and masters students jointly as part of the Samsung PRISM program at DAU. Recently, he offered a joint tutorial with Prof. Haizhou Li (IEEE Fellow and ISCA Fellow) during APSIPA ASC 2017, and INTERSPEECH 2018. He offered a joint tutorial with Prof. Heidiki Kawahara (IEEE Fellow and ISCA Fellow) on the topic, “Voice Conversion: Challenges and Opportunities,” during APSIPA ASC 2018, Honolulu, USA. He spent his Sabbatical Leave at Samsung R&D Institute, Bengaluru from May 2019 to August 2019. He has been selected as an APSIPA Distinguished Lecturer (DL) for 2018-2019, and he has 25+ APSIPA DLs in four countries, namely, India, Singapore, China, and Canada. Recently, he was selected as an ISCA Distinguished Lecturer (DL) for 2020-2022 and delivered 28+ ISCA DLs in India, USA, and Malaysia.