Room Acoustics and Speaker Distance Estimation Challenge

Introduction

This challenge calls for RIR generation systems to augment RIR data for the downstream task, speaker distance estimation (SDE), as part of the Generative Data Augmentation workshop at ICASSP 2025. In this challenge GenDARA (Generative Data Augmentation of Room Acoustics), participants are tasked first with building RIR generation systems to augment sparse RIR data: a handful of RIRs with labeled source-receiver positions are provided to participants, and their RIR generation system should be able to generate RIRs at new source-receiver positions to supplement the initial handful of RIRs. Subsequently, SDE systems are to be trained with this augmented RIR dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by RIR generation systems affects SDE model performance.

Challenge tasks:

We provide baseline experiments using an open-source SDE model to encourage participation and benchmark advancements.

Participation is open to all. Participants are asked to participate in both tasks rather than choosing one of the two tasks since our challenge aims to evaluate the effectiveness of generated RIRs for the downstream SDE problem.

The codebase and data can be found on the GenDARA github. For the technical details and experimental results, please refer to our paper

Figure 1: Overview of the GenDARA 2025 Challenge tasks and workflow.

Challenge Details and Rules

Task 1: Augmenting RIR Data with RIR Generation System

In Task 1, we aim to evaluate the participant's RIR generation system based on the RIRs it generates, although the main point of the challenge is to investigate its usefulness for the downstream task. At any rate, the goal of the RIR generation system is to generate RIRs from new source and receiver locations from a given set of RIRs collected within a room. The goal is to augment sparsely collected RIR data, whose quantity tends to be limited, and then use the augmented dataset to help improve the performance of downstream tasks such as speaker distance estimation, dereverberation, etc.

Enrollment Data Description

We provide data on 20 different rooms in. Rooms 1-10 are simulated using Treble Technologies' wave-based simulator. Rooms 11-20 are sampled from the GWA Dataset [1], which are simulated from a hybrid wave-based and geometrical acoustics simulator.

For Rooms 1-10, for each room we provide:

For Rooms 11-20, for each room we provide:

Additionally, we provide a control set of RIRs in Room_0 enrollment_data/Room_0_data for participants to calibrate their systems on the RIRs recorded in the real room as necessary. Room_0 is a physical room at Treble's offices with variable wall absorption and furniture layout. 20 single-channel RIRs were measured in Room_0, and their simulated counterparts (same source and receiver positions) generated from Treble's RIR simulator are provided. Also, a grid of virtual receivers were simulated and those simulated RIRs are provided:

Evaluation 1

As mentioned above, the participant's RIR generation system is evaluated by the quality of the generated RIRs. The room, source, and receiver positions of the requested submission RIRs are found here. The RIRs will be evaluated on their T60, DRR, and EDF similarity to the withheld reference RIRs. The RIR evaluation we will perform is shown in this jupyter notebook on Room_0 as an example.

Task 2: Improving Speaker Distance Estimation Model with Augmented RIR Data

In Task 2, we ask participants to improve a speaker distance estimation (SDE) model by fine-tuning a pre-trained baseline SDE model using their augmented RIR dataset generated from the sparse enrollment data in Task 1. Then we will evaluate the performance of their fine-tuned SDE model to measure the effectiveness of the augmented RIR dataset.

Baseline SDE System

We retrain a SoTA Speaker Distance Estimation Model [2] on the C4DM room impulse response dataset [3] and the VCTK speech dataset [4]. Hence, the baseline must have been optimized to the C4DM and VCTK combination without knowing the details about the target rooms of the challenge. Yet, it provides a reasonable start point. We release it open-source. The baseline SDE system code, checkpoint, and training script are found here.

Participants are to fine-tune the provided baseline SDE system. 

Evaluation 2

The participant's fine-tuned SDE systems must estimate the speaker distance from a test set of 480 reverberant speech audio. The provided baseline SDE system's estimates for the test audio are in this .csv file. Participants are asked to submit a .csv file containing their updated distance estimates in meters. To generate the submission .csv, run this juptyer notebook after replacing the baseline checkpoint path with the participant’s fine-tuned checkpoint path.

If the participant is estimating distances using their custom SDE model as described in Task 2 Bonus, they must submit an additional .csv file.

The submitted distance estimates will be evaluated on

Submission Instructions

We use ICASSP 2025’s submission system on CMT.

Paper Guideline

Participants are asked to submit a minimum two-page (but not exceeding four pages) summary of their system detailing the following information:

Important Dates

Cite Our Work

If you’d like to refer to the challenge or use it in your research, please cite our paper:

BibTeX:

@inproceedings{GenDA2025_RoomAcoustics,
  title={Generative Data Augmentation Challenge: Synthesis of Room Acoustics for Speaker Distance Estimation},
  author={Jackie Lin and Georg G\"otz and Hermes Sampedro Llopis and Haukur Hafsteinsson and Steinar Gu{\dh}j\'onsson and Daniel Gert Nielsen and Finnur Pind and Paris Smaragdis and Dinesh Manocha and John Hershey and Trausti Kristjansson and Minje Kim},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing Workshops(ICASSPW)},
  year={2025}
}

References

[1] Zhenyu Tang and Rohith Aralikatti and Anton Ratnarajah and and Dinesh Manocha, “GWA: A Large Geometric-Wave Acoustic Dataset for Audio Processing,” Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings (SIGGRAPH '22 Conference Proceedings), 2022. https://doi.org/10.1145/3528233.3530731

[2] M. Neri, A. Politis, D. Krause, M. Carli, and T. Virtanen, “Speaker distance estimation in enclosures from single-channel audio,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.

[3] R. Stewart and M. Sandler, “Database of omnidirectional and b-format room impulse responses,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 165–168, IEEE, 2010.

[4] J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” Nov 2019.

Contacts