To the best of our knowledge, there are no existing benchmarks for evaluating LARCQ solutions. To fill this gap, we introduce two new benchmarks.
Clotho-LARCQ Benchmark
In this benchmark, we synthesize longer audios by randomly concatenating 5 audios from the Clotho test split, yielding audios between 75-150 seconds. Complex queries are generated by transforming captions into natural queries through the Mixtral text LLM. From Clotho’s five captions per audio, one is selected and condensed. For each of the 5 audios, their queries are concatenated to form a single complex query spanning multiple sound events. In total, we curate 1000 long audios and their queries are manually checked for correctness.
SoundDescs-LARCQ Benchmark
While Clotho-LARCQ provides a semi-synthetic setting with complex queries corresponding to segments of long audios, we also introduce a more natural benchmark. The well-known SoundDescs dataset owns a wide range of audio durations, but some audios are minutes long with a three-word caption. With similar distributions of Clotho-LARCQ, we filter for audios between 75-150 seconds with captions exceeding 150 characters as complex queries. This results in 1639 audio-query pairs, forming our SoundDescs-LARCQ benchmark.