Generative AI has revolutionized data synthesis and augmentation, offering new possibilities for signal processing. This workshop, a satellite event of ICASSP 2025, focuses on leveraging generative AI, including GANs, VAEs, Transformer-based models, and diffusion models, as well as adaptation techniques, such as zero- or few-shot learning, to create and enhance datasets for applications in speech, audio, music, and other data-sensitive multimodal signal processing domains. Participants will explore state-of-the-art AI techniques for synthesizing high-quality, diverse datasets that address data scarcity and mirror complex real-world scenarios. These datasets can be used to train models in a supervised fashion, while we also welcome contributions investigating their use in self-supervised or semi-supervised learning scenarios. The workshop will feature keynote talks, paper presentations, and panel discussions led by experts from academia and industry. Key topics include AI methodologies for realistic data generation, integrating synthetic data into workflows, and ethical considerations, including privacy-related issues, such as generative AI that can preserve the statistical properties of real data without exposing sensitive information. Technically sponsored by the IEEE SPS Audio and Acoustics Signal Processing Technical Committee, Speech and Language Processing TC, and Data Science Initiative as a Data Science and Learning Workshop, this workshop is also a successor of the inaugural workshop on “Synthetic Data’s Transformative Role in Foundational Speech Models,” an Interspeech 2024 satellite workshop. This workshop aims to foster a collaborative environment, inspire innovation, and push the boundaries of signal processing, AI, and data science research and practice. Join us at ICASSP 2025 to explore the transformative potential of generative AI and contribute to the future of this exciting field.
Generative AI Techniques: Leveraging GANs, VAEs, Transformer-based models, diffusion models, autoregressive models, and other generative methods for realistic data generation.
Data Augmentation and Enhancement: Strategies for integrating synthetic data into existing workflows to improve model performance.
Adaptation Techniques: Approaches such as zero- or few-shot learning in the context of generative AI.
Ethical Considerations and Privacy: Discussions on the ethical use of synthetic data, including privacy-preserving techniques.
Application Domains: Case studies in speech, audio, music, multimedia, and cross-domain applications.
Learning Paradigms: Exploring the use of synthetic data in supervised, self-supervised, and semi-supervised learning scenarios.
Title: Generative AI Music Beyond Text-to-Music
Abstract: Why is generative AI music interesting? How does it fundamentally differ from pre-recorded songs? And what are the key challenges in music generation research? In this keynote, I will explore these questions, advocate for pushing beyond text-to-music, review recent advancements in controllable and fast music generation and conclude we focus on creating new design language for music co-creation and editing. References for the discussion include Music ControlNet (IEEE TASLP 2024), DITTO (ICML Oral 2024), MusicHiFi (IEEE SPL 2024), DITTO2 (ISMIR 2024), and Presto (ICLR Spotlight 2025).
Bio: Nicholas J. Bryan is a senior research scientist and head of the Music AI team at Adobe Research. His research focuses on audio and music, generative AI, and signal processing. Nick received his PhD and MA from CCRMA, Stanford University, and an MS in Electrical Engineering, also from Stanford. He also holds a Bachelor of Music and a BS in Electrical Engineering from the University of Miami (FL), graduating summa cum laude with general and departmental honors. Nick has received two best paper awards, an AES Graduate Design Gold Award, and a best reviewer award. He was General Co-Chair of WASPAA 2023, a two-time elected member of the IEEE AASP TC, an IEEE Senior Member, and an Adobe Distinguished Inventor.
Title: Artificial Intelligence in the workflow of audio signal processing
Abstract: In the last few years, following the explosion of machine learning for addressing complex problems, artificial intelligence has become more and more integrated into systems, where very sophisticated tasks are approached in a divide and conquer fashion, combining traditional signal processing algorithms with AI blocks. In this keynote, we will analyze two scenarios: the use of artificial intelligence for room acoustic extrapolation and AI-generated audio for the reinforcement of training sets in the context of sound event detection. The keynote will focus on answering two fundamental questions related to AI-generated data augmentation: 1) "what is the best way to integrate AI generated data in the training workflow of event detection systems?"; and 2) "what architectures are best suited to augment datasets in acoustic data generation?"
Bio: Fabio Antonacci is associate professor with Politecnico di Milano. His research interests include space-time processing of audio signals for both speaker and microphone arrays and musical acoustics, in particular on the development of innovative non invasive measurement methodologies.
Title: Scaling Flow-Based Models for X-Conditional Audio Generation
Abstract: Generative modeling has seen significant advancements in recent years, with increasing interest and scaling in models and data across image, video, and audio domains. Beyond generation, these developments open new directions to apply generative approaches for data augmentation in real-world signal processing. This talk will focus on flow-based generative models for X-conditional audio generation, including text-to-speech (TTS), text-conditioned audio synthesis, large-scale pretraining, and video-to-audio generation. Key contributions from Voicebox, Audiobox, SpeechFlow, and Movie Gen Audio will be discussed, highlighting their impact on controllable and high-fidelity audio synthesis. Additionally, I will briefly touch on non-autoregressive approaches in state-of-the-art models and explore their potential for data augmentation in speech and audio processing. Finally, I will discuss ongoing challenges and future directions, including their role in advancing both generation and understanding of audio.
Bio: Apoorv Vyas is a researcher in the Audiobox team at FAIR Meta, where he focuses on developing flow-based generative models for audio and speech. He earned his PhD from EPFL, Switzerland, under the supervision of Prof. Hervé Bourlard, specializing in data and compute-efficient speech recognition for Transformer-based models. Before that, he completed his Bachelor's in Electrical Engineering at the Indian Institute of Technology Guwahati. Prior to academia, Apoorv worked at Intel Labs, where he conducted research on compressed sensing for power-efficient communications.
Location: Room MR 2.02
09:30-09:40: Opening Remarks
09:40-11:00: Oral session
09:40-10:00
Title: HARP: A Large-Scale Higher-Order Ambisonic Room Impulse Response Dataset
Authors: Shivam Saini, Juergen Peissig
* The presenter, Shivam Saini, will also present their submission to the room acoustics challenge
Challenge Submission Title: TranslatIR: Transformer based approach for RIR interpolation
Authors: Shivam Saini, Miguel Perez, Jürgen Peissig
10:00-10:20
Title: Singing Voice Accompaniment Data Augmentation with Generative Models
Authors: Miguel Perez Fernandez, Holger Kirchhoff, Peter Grosche, Xavier Serra
10:20-10:40
Title: Mind the Prompt: Prompting Strategies in Audio Generations for Improving Sound Classification
Francesca Ronchini, Ho-Hsiang Wu, Wei-Cheng Lin, Fabio Antonacci
10:40-11:00
Title: 3D Gaussian Splatting with Normal Information for Mesh Extraction and Improved Rendering
Authors: Meenakshi Krishnan, Liam Fowl, Ramani Duraiswami
11:00-11:30: Morning break
11:30-12:30: Keynote (Fabio Antonacci)
12:30-13:00: Zero-Shot TTS and PSE challenge (Jaesung Bae)
13:00-14:00: Lunch
14:00-15:00: Keynote (Nicholas J. Bryan)
15:00-15:30: Room acoustics and SDE challenge (Jackie Lin)
15:30-16:00: Afternoon break
16:00-17:00: Keynote (Apoorv Vyas)
17:00-17:30: Challenge Submissions
Title: Data Augmentation Using Neural Acoustic Fields With Retrieval-Augmented Pre-training
Authors: Christopher Ick, Gordon Wichern, Yoshiki Masuyama, François Germain, Jonathan Le Roux
The organizers also invite submissions to the personalized speech enhancement challenge, which is designed to improve the speech denoising problem for a particular user, respecting his/her personality in the learned model. We will evaluate the speech enhancement performance of the test-time speakers, whose identity is known only by a very short utterance (~3 sec), from which the submitters are encouraged to flesh out via generative speech synthesis techniques and eventually improve their speech enhancement model. A paper on the concept and the baseline system can be found here.
More details can be found on the challenge page here.
The synthesis of the room acoustics challenge is a part of the generative data augmentation workshop at ICASSP 2025. The challenge defines a unique generative task that is designed to improve the quantity and diversity of the room impulse responses dataset so that it can be used for spatially sensitive downstream tasks: speaker distance estimation. The challenge identifies the technical difficulty in measuring or simulating many rooms' acoustic characteristics precisely. Instead, it proposes generative data augmentation as an alternative that can potentially be used to improve various downstream tasks.
More details can be found on the challenge page here.
We invite papers for the archival track, where we host papers to be published on IEEE Xplore Digital Library.
Submission Deadline: November 1, 2024 November 8, 2024
Acceptance Notification: December 18, 2024
Camera Ready Paper Deadline: January 13, 2025 January 18, 2025
(The archival track paper submission is closed.)
We also invite paper contributions to our non-archival track, where the authors can bring work-in-progress or already-published works. These papers won't be considered as publication, but will be posted on our website as part of the technical program. Authors are also invited to present at the workshop.
Submission Deadline: March 1, 2025
Acceptance Notification: March 8, 2025
Submission URL: https://cmt3.research.microsoft.com/icassp2025/
Choose “Satellite Workshop: Generative Data Augmentation for Real-World Signal Processing Applications” from the dropdown menu.
Choose "Non-Archival Track (Won't Be Published in IEEE Xplore)" as the primary subject area.
We will follow ICASSP 2025's paper guidelines although the papers in this track won't be published on IEEE Xplore: https://2025.ieeeicassp.org/author-kit-instructions/
(The non-archival track paper submission is closed.)
Minje Kim
UIUC/Amazon
Minje Kim is Associate Professor in the Dept. of Computer Science at the University of Illinois at Urbana-Champaign and Visiting Academic at Amazon Lab126. Before then, he was an associate professor at Indiana University (2016-2023). He earned his Ph.D. in Computer Science at UIUC (2016) after working as a researcher at ETRI, a national lab in Korea (2006 to 2011). He is a recipient of various awards, including the NSF Career Award (2021), IU Trustees Teaching Award (2021), IEEE SPS Best Paper Award (2020), Google and Starkey’s grants for outstanding student papers in ICASSP 2013 and 2014, respectively. He is an IEEE Senior Member and also belongs to the IEEE AASP TC as a member (2018-2023) and the Vice Chair (2024). He is on the editorial boards as Senior Area Editor for IEEE/ACM TASLP and IEEE SPL and as Associate Editor for EURASIP JASMP and consulting AE for IEEE OJSP. He has organized various workshops and academic events such as IEEE WASPAA 2023 (general chair) and HSCMA as an ICASSP 2024 satellite workshop (organizing chair). He has published various research papers as the PI of the NSF-funded personalized speech enhancement project (https://minjekim.com/research-projects/pse/) where data augmentation is the key technology in achieving the model personalization goals.
Dinesh Manocha
University of Maryland
Dinesh Manocha is Paul Chrisman-Iribe Professor in Computer Science & Electrical and Computer Engineering and Distinguished University Professor at University of Maryland College Park. His research interests include virtual environments, physically-based modeling, and robotics. His group has developed several packages for multi-agent simulation, robot planning, and physics-based modeling that are standard in the field and licensed to more than 60 commercial vendors. He has published more than 760 papers & supervised 50 PhD dissertations. He is an inventor of 17 patents, some of which are licensed to industry. He is a Fellow of AAAI, AAAS, ACM, IEEE, and NAI, member of ACM SIGGRAPH and IEEE VR Academies, and Bézier Award recipient from Solid Modeling Association. He received the Distinguished Alumni Award from IIT Delhi the Distinguished Career in Computer Science Award from Washington Academy of Sciences. He was a co-founder of Impulsonic, a developer of physics-based audio simulation technologies, which Valve Inc. acquired in November 2016. He is also a co-founder of Inception Robotics, Inc. He has worked in audio, speech processing, and vision for more than two decades and published a large number of papers at ICASSP, InterSpeech, CVPR, ICCV, ECCV, etc. He has served as a program and general chair for a large number of workshops and conferences organized at major conferences.
John Hershey
Google Research
John Hershey I am a researcher in Google AI Perception in Cambridge, Massachusetts where I lead a research team in the area of speech and audio machine perception. Prior to Google I spent seven years leading the speech and audio research team at MERL (Mitsubishi Electric Research Labs), and five years at IBM's T. J. Watson Research Center in New York, where I led a team of researchers in noise-robust speech recognition. I also spent a year as a visiting researcher in the speech group at Microsoft Research in 2004, after obtaining my Ph D from UCSD. Over the years I have contributed to more than 100 publications and over 30 patents in the areas of machine perception, speech and audio processing, audio-visual machine perception, speech recognition, and natural language understanding.
Trausti Kristjansson
Amazon Lab126
Trausti Kristjansson is a Senior Research Manager of the Audio ML Team at Amazon Lab126 in Sunnyvale California. Trausti has wide technical experience ranging from developing super-human machine learning algorithms to developing mobile and web applications. Trausti has experience managing teams at larger organizations as well as leading lean startups. He has worked at the industry’s most respected research labs (Microsoft Research, IBM Research, Google Research), founded full-stack startups and currently the Audio Machine Learning team at Amazon Lab126 as a Senior Research Science Manager. Trausti worked in the Speech Recognition group at Google for 6 years where he lead research and development on audio quality and noise robustness. Trausti contributed substantially to the real-time ASR products that Google provides and are used by millions of people every day. At IBM Research, Trausti started and led the team that built the “IBM Super Human Speech Separation ASR system” that was featured in Scientific American Magazine. Prior to IBM, Trausti worked at Microsoft Research. There he worked on 3D scene learning algorithms, interactive methods of turning unstructured data into structured data as well as noise robustness and speech separation. Trausti has a broad background in Machine Learning and Signal Processing applied to Speech and Vision. Trausti holds an adjunct professorship at the University of Reykjavik, Iceland. Trausti received his Ph.D. in Computer Science from the University of Waterloo in Canada and a M.Sc. in Electrical Engineering from the University of Illinois, Urbana Champaign in the USA and a Cand. Sci. (B.Sc.) from the University of Iceland.
Jaesung Bae
UIUC
The Task Captain for Zero-Shot TTS and PSE Challenge. A PhD student at UIUC.
Jackie Lin
UIUC
The Task Captain for Room Acoustics and Speaker Distance Estimation Challenge. A PhD student at UIUC.
Sponsored by
Amazon Lab 126
Speech and Language Processing Technical Committee (IEEE Signal Processing Society)
Audio and Acoustic Signal Processing Technical Committee (IEEE Signal Processing Society)
Data Science Initiative (IEEE Signal Processing Society)