Room Acoustics and Speaker Distance Estimation Challenge
Introduction
This challenge calls for RIR generation systems to augment RIR data for the downstream task, speaker distance estimation (SDE), as part of the Generative Data Augmentation workshop at ICASSP 2025. In this challenge GenDARA (Generative Data Augmentation of Room Acoustics), participants are tasked first with building RIR generation systems to augment sparse RIR data: a handful of RIRs with labeled source-receiver positions are provided to participants, and their RIR generation system should be able to generate RIRs at new source-receiver positions to supplement the initial handful of RIRs. Subsequently, SDE systems are to be trained with this augmented RIR dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by RIR generation systems affects SDE model performance.
Challenge tasks:
Task 1: Augmenting RIR Data with RIR Generation System
Task 2: Improving Speaker Distance Estimation Model with Augmented RIR Data
We provide baseline experiments using an open-source SDE model to encourage participation and benchmark advancements.
Participation is open to all. Participants are asked to participate in both tasks rather than choosing one of the two tasks since our challenge aims to evaluate the effectiveness of generated RIRs for the downstream SDE problem.
The codebase and data can be found on the GenDARA github. For the technical details and experimental results, please refer to our paper.
Figure 1: Overview of the GenDARA 2025 Challenge tasks and workflow.
Challenge Details and Rules
Task 1: Augmenting RIR Data with RIR Generation System
In Task 1, we aim to evaluate the participant's RIR generation system based on the RIRs it generates, although the main point of the challenge is to investigate its usefulness for the downstream task. At any rate, the goal of the RIR generation system is to generate RIRs from new source and receiver locations from a given set of RIRs collected within a room. The goal is to augment sparsely collected RIR data, whose quantity tends to be limited, and then use the augmented dataset to help improve the performance of downstream tasks such as speaker distance estimation, dereverberation, etc.
To evaluate the RIR generation system performance, participants are to generate the RIRs at specified source-receiver locations in 20 rooms.
We are most interested in generating RIRs from a few (5) RIRs are captured per room, which mimics a real-world user scenario. Thus we encourage participants to only use the subset of single-channel RIRs in the provided enrollment data.
However, we recognize there exists many scenarios where RIRs are synthesized from other types of real-world enrollment data, for example: 3D scan of the room, images, higher order Ambisonics RIR (HOA-RIR), binaural RIRs, etc.
To encourage participation from participants who work on RIR generation systems that take these modalities as input, we also provide and allow the use of 3D models and HOA-RIRs from half of the rooms as additional enrollment data.
We will categorize submissions based on the subset of enrollment data used in the participant's RIR generation system.
Enrollment Data Description
We provide data on 20 different rooms in. Rooms 1-10 are simulated using Treble Technologies' wave-based simulator. Rooms 11-20 are sampled from the GWA Dataset [1], which are simulated from a hybrid wave-based and geometrical acoustics simulator.
For Rooms 1-10, for each room we provide:
5 single-channel RIRs
5 8th-order HOA RIRs
labeled source + receiver positions
3D model of the room with furniture
For Rooms 11-20, for each room we provide:
5+ single-channel RIRs
labeled source + receiver positions
Additionally, we provide a control set of RIRs in Room_0 enrollment_data/Room_0_data for participants to calibrate their systems on the RIRs recorded in the real room as necessary. Room_0 is a physical room at Treble's offices with variable wall absorption and furniture layout. 20 single-channel RIRs were measured in Room_0, and their simulated counterparts (same source and receiver positions) generated from Treble's RIR simulator are provided. Also, a grid of virtual receivers were simulated and those simulated RIRs are provided:
20 measured single-channel RIRs
20 simulated single-channel & 8th-order HOA RIRs at measurement positions
405 simulated single-channel & 8th-order HOA RIRs at grid positions
labeled source + receiver positions
3D model of the room with furniture
Evaluation 1
As mentioned above, the participant's RIR generation system is evaluated by the quality of the generated RIRs. The room, source, and receiver positions of the requested submission RIRs are found here. The RIRs will be evaluated on their T60, DRR, and EDF similarity to the withheld reference RIRs. The RIR evaluation we will perform is shown in this jupyter notebook on Room_0 as an example.
Task 2: Improving Speaker Distance Estimation Model with Augmented RIR Data
In Task 2, we ask participants to improve a speaker distance estimation (SDE) model by fine-tuning a pre-trained baseline SDE model using their augmented RIR dataset generated from the sparse enrollment data in Task 1. Then we will evaluate the performance of their fine-tuned SDE model to measure the effectiveness of the augmented RIR dataset.
Participants are NOT allowed to use any RIR data that was not generated from their RIR generation systems during fine-tuning. Again, this challenge is to evaluate the quality of the RIR generation system based on the generated dataset's usefulness in improving SDE, so it would be uninformative if the SDE model were improved with other RIR data.
That being said, there is no limit on the number of generated RIRs that the participants can add to their augmented dataset. The point is that participants are allowed to build an RIR augmented dataset that best improves performance of SDE. Thus, the participants can generate more RIRs than requested in the Task 1 evaluation.
For a fair comparison, participants are first requested to estimate speaker distance with the baseline SDE model architecture that we provide, fine-tuned on participants' generated RIRs. This is to make sure that the evaluation and comparison are based on the quality of the synthesized data not the model architecture.
We provide the pre-trained baseline SDE system, described below, that participants can fine-tune using the augmented RIR dataset. Additionally, the original training script is provided for participants to use to fine-tune the baseline or as a reference.
BONUS: We encourage participants to develop their own SDE systems with custom architecture. If participants choose this option, they must still complete the fine-tuning on the provided baseline SDE model.
Baseline SDE System
We retrain a SoTA Speaker Distance Estimation Model [2] on the C4DM room impulse response dataset [3] and the VCTK speech dataset [4]. Hence, the baseline must have been optimized to the C4DM and VCTK combination without knowing the details about the target rooms of the challenge. Yet, it provides a reasonable start point. We release it open-source. The baseline SDE system code, checkpoint, and training script are found here.
Participants are to fine-tune the provided baseline SDE system.
Evaluation 2
The participant's fine-tuned SDE systems must estimate the speaker distance from a test set of 480 reverberant speech audio. The provided baseline SDE system's estimates for the test audio are in this .csv file. Participants are asked to submit a .csv file containing their updated distance estimates in meters. To generate the submission .csv, run this juptyer notebook after replacing the baseline checkpoint path with the participant’s fine-tuned checkpoint path.
If the participant is estimating distances using their custom SDE model as described in Task 2 Bonus, they must submit an additional .csv file.
The submitted distance estimates will be evaluated on
Absolute Distance Error [m]
Percentage Distance Error [%]
Submission Instructions
We use ICASSP 2025’s submission system on CMT.
Login as an “Author” at https://cmt3.research.microsoft.com/ICASSP2025/
Choose “+Create new submission…” menu on the top left
Choose the workshop “Satellite Workshop: Generative Data Augmentation for Real-World Signal Processing Applications”
Fill out the author form and choose “Challenge: Room Acoustics and Speaker Distance Estimation” as the primary subject area
Participants are required to submit a two-page report (details are found in the next section).
Once after the submission of the report pdf, you will be able to see in the author console that your submission is created. On the rightmost column of your submission, you can upload the “supplementary material” which must contain all the zipped submission files.
As described in the challenge details, participants are expected to submit 102 wav files for Task 1 and a single .csv file with 480 distance estimates for Task 2.
Please follow the directory format as below. Thank you.
Paper Guideline
Participants are asked to submit a minimum two-page (but not exceeding four pages) summary of their system detailing the following information:
Technical details of the RIR generation systems they developed.
Training data used to develop the RIR generation systems, including any copyright and ethics-related issues or approvals
A description of the subset of enrollment data that was used to generate the augmented RIR dataset.
Technical details on the generated augmented RIR dataset.
Technical details on the fine-tuning protocol of the baseline SDE system.
(Optional) Any results on the third-party evaluation data they used to validate the SDE model
(Bonus) Description of the new SDE architecture they propose.
Important Dates
Dec. 23, 2024: Submission system open
March 12, 2025: Deadline to submit the participating system (a two-page summary and submission files)
Challenge results will be posted on the website in early April in 2025 (before the beginning of the conference)
Cite Our Work
If you’d like to refer to the challenge or use it in your research, please cite our paper:
Jackie Lin, Georg Götz, Hermes Sampedro Llopis, Haukur Hafsteinsson, Steinar Guðjónsson, Daniel Gert Nielsen, Finnur Pind, Paris Smaragdis, Dinesh Manocha, John Hershey, Trausti Kristjansson, and Minje Kim, “Generative Data Augmentation Challenge: Synthesis of Room Acoustics for Speaker Distance Estimation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW): Generative Data Augmentation for Real-World Signal Processing Applications (GenDA 2025), Hyderabad, India, Apr. 6-11, 2025. [PDF]
BibTeX:
@inproceedings{GenDA2025_RoomAcoustics,
title={Generative Data Augmentation Challenge: Synthesis of Room Acoustics for Speaker Distance Estimation},
author={Jackie Lin and Georg G\"otz and Hermes Sampedro Llopis and Haukur Hafsteinsson and Steinar Gu{\dh}j\'onsson and Daniel Gert Nielsen and Finnur Pind and Paris Smaragdis and Dinesh Manocha and John Hershey and Trausti Kristjansson and Minje Kim},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing Workshops(ICASSPW)},
year={2025}
}
References
[1] Zhenyu Tang and Rohith Aralikatti and Anton Ratnarajah and and Dinesh Manocha, “GWA: A Large Geometric-Wave Acoustic Dataset for Audio Processing,” Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings (SIGGRAPH '22 Conference Proceedings), 2022. https://doi.org/10.1145/3528233.3530731
[2] M. Neri, A. Politis, D. Krause, M. Carli, and T. Virtanen, “Speaker distance estimation in enclosures from single-channel audio,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
[3] R. Stewart and M. Sandler, “Database of omnidirectional and b-format room impulse responses,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 165–168, IEEE, 2010.
[4] J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” Nov 2019.
Contacts
Jackie Lin (jackiel4@illinois.edu)
Minje Kim (minje@illinois.edu)