Audiocards

Left: We propose Audiocards, structured metadata which describes an audio file with attributes relevant to sound designers. We prompt an LLM with the available text metadata and audio descriptors, and generate an audiocard, which can be used for text-based search and to train audio-language models.
Right: Audiocard generated by our Whisper-cards audio captioner from input audio without text metadata.

Abstract

Sound designers search for sounds in large sound effects libraries using aspects such as sound class or visual context. However, the metadata needed for such search is often missing or incomplete, and requires significant manual effort to add. Existing solutions to automate this task by generating metadata, i.e. captioning, and search using learned embeddings, i.e. text-audio retrieval, are not trained on metadata with the structure and information pertinent to sound design. To this end we propose audiocards, structured metadata grounded in acoustic attributes and sonic descriptors, by exploiting the world knowledge of LLMs. We show that training on audiocards improves downstream text-audio retrieval, descriptive captioning, and metadata generation on professional sound effects libraries. Moreover, audiocards also improve performance on general audio captioning and retrieval over the baseline single-sentence captioning approach. We release a curated dataset of sound effects audiocards to invite further research in audio language modeling for sound design.

Key contributions

1) We propose audiocards, structured descriptions of audio files with fields consisting of attributes relevant to sound design.

2) We show that we can generate audiocards from audio without accompanying metadata better than state-of-the-art large audio language models (LALMs).

3) We show that training on audiocards leads to significantly better captions for professional sound effects libraries compared to both a baseline approach and state-of-the-art LALMs.

4) Training on audiocards improves embedding-based text-audio retrieval when using human-annotated captions as queries on a professional sound effects library.

Dataset release: ASFx eval audiocards

We extracted an evaluation subset from Adobe Audition Sound Effects for which we manually verified that the generated audiocards from text metadata accurately describe the audio file without incoherence or hallucination issues, called ASFx eval. ASFx eval consists of 500 such manually verified audiocards generated using Pixtral-12B-2409.

To evaluate audio language models on tasks such as text-audio retrieval, audio captioning, and structured metadata generation, use the filename column in the audiocards csv to pair the audiocards with their corresponding audio files from Adobe Audition Sound Effects.

The bar chart here shows the dataset distribution of ASFx eval by Audition Sound Effects category names.