Dataset

Document Collection

Participants are required to utilize the IndicMARCO collection as the document set—a translated version of the MSMARCO dataset that includes 11 Indian languages. For this year, participants are required to utilize the Gujarati, Hindi, Bengali, and Kannada language collections.

Participants can access the IndicMARCO collection from here.

Query Data

The Spoken Query dataset was created using the TREC DL 2019/2020 English queries, which were translated into Indian languages and recorded using Audacity software by native male and female speakers proficient in English, Gujarati, Hindi, Bengali, and Kannada. This process ensures that the spoken queries reflect natural language usage and dialectal nuances, providing a robust foundation for developing and evaluating retrieval systems.

Spoken Query File Details:

Each language-specific folder contains separate subfolders for male and female speakers, each holding spoken query recordings. The files are named using the format id.wav, where id corresponds to the unique identifier of the query.

Example: For a Query with id 123 spoken by the speaker, the file name would be: 123.wav.

Participants can access the text and spoken query from here.

Qrel File

Qrel files contain human judgments regarding the relevance of documents in relation to queries. These files are essential for evaluating the performance of IR systems. The format of a Qrel file includes four fields: Query, ITERATION, DOCUMENT#, and RELEVANCY. Query represents the identifier for the search query, ITERATION is typically set to zero and is often not used, DOCUMENT# is the unique identifier for the document, and RELEVANCY indicates whether the document is deemed relevant (1) or not relevant (0) to the query.

Sample Qrels File:

19335 Q0 7007720 0

19335 Q0 7122355 1

19335 Q0 712804 0

19335 Q0 712806 0

The qrel file released for the TREC DL 2019/2020 English queries is used for evaluation across all languages.

Participants can access the text and spoken query qrels from here .

Note: This year, we are releasing text and spoken queries along with their corresponding qrels. For the document corpus, participants are advised to use the IndiMARCO collection for Hindi, Gujarati, Bengali, and Kannada documents, and the original MSMARCO collection for English documents.

Decryption key for the FIRE spoken query dataset can be obtain by registering for task.

Page updated

Google Sites

Report abuse