Problem Formulation
Our problem falls in the general domain of text-to-audio retrieval with a focus on long audio retrieval with complex queries (LARCQ). Given a complex text query specifying multiple events and a large audio datastore of long audio files, we propose an end-to-end pipeline of efficiently retrieving the most accurate audio with the text query.
End-to-end Pipeline
Our proposed pipeline for LARCQ, where Steps 1 and 2 perform multi-modal retrieval and Steps 3 and 4 perform ALM/LLM refining. At Steps 1 and 2, the chunking method is implemented for both the text query and each audio in the large audio datastore. After computing the cosine similarity score and aggregating the final score for each audio, we select 5 audios based on top 5 final scores. At Steps 3 and 4, we combine ALM captioning and text LLM/text classifier re-ranking along with designed prompts to select the final result ``audio 98" out of the 5 audios retrieved at Steps 1 and 2.