Dataset

In this inaugural year, we will leverage the FIRE corpus, a rich and diverse dataset that includes several Indian languages, such as English, Hindi, Bengali, Gujarati, and Marathi. This corpus is sourced from reputable publications, including Anandabazar Patrika for Bengali, Gujarat Samachar for Gujarati, Indiatimes and Dainik Jagran for Hindi, and The Telegraph for English. The dataset provides a robust foundation for developing and evaluating retrieval systems, featuring a wide range of documents.

Participants can access the FIRE Collection from here.

FIRE Spoken Data

Datset Descreption

The FIRE Spoken Query data was created using the FIRE dataset, with queries spoken by native speakers proficient in English, Gujarati, Hindi, and Bengali. This ensures that the spoken queries reflect natural language use and dialectal nuances in these languages, providing a robust basis for developing and evaluating retrieval systems.

Spoken Query File Details:

Each Language folder contains a Spoken Query recorded by speakers. The naming format for each query file follows the following format: language_id.wav.

Example: For Query with id 123 Spoken in Gujarati by speaker, the file name would be: gu_123.wav.

Qrel File Details :

Qrel files contain human judgments regarding the relevance of documents in relation to queries. These files are essential for evaluating the performance of IR systems. The format of a Qrel file includes four fields: Query, ITERATION, DOCUMENT#, and RELEVANCY. Query represents the identifier for the search query, ITERATION is typically set to zero and is often not used, DOCUMENT# is the unique identifier for the document, and RELEVANCY indicates whether the document is deemed relevant (1) or not relevant (0) to the query.

Sample Qrels File:

26 Q0 1050804_bengal_story_5072499.utf8 0

26 Q0 1050806_frontpage_story_5081177.utf8 1

26 Q0 1050815_bengal_story_5116647.utf8 1

26 Q0 1050822_opinion_story_5136585.utf8 1

Note: This year, we are releasing only the Spoken Query and its Qrels. For the corpus, please use the FIRE Collection. For example, for Hindi, use the data from 2008, 2010, and 2011.

Spoken Query Train Data for Task 1

Spoken Query Test Data Task 2

Decryption key for the FIRE spoken query dataset can be obtain by registering for task.

Page updated

Google Sites

Report abuse