Spoken-Query Cross-Lingual Information Retrieval for the Indic Languages (SqCLIR)

at FIRE 2025

Introduction

India is known for its linguistic diversity, featuring a multitude of languages. The Constitution of India recognizes 22 languages under the Eighth Schedule. These include Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Sindhi, Tamil, Telugu, Urdu, Bodo, Santhali, Maithili, and Dogri. Building a retrieval system that handles spoken queries in one of India's 22 officially recognized languages and locates relevant documents in a large knowledge base is multifaceted and complex. To our knowledge, spoken-query retrieval is a relatively underexplored area in information retrieval and natural language processing, and it is a multi-lingual version that includes under-resourced languages.

To advance research in this direction, we introduce the second iteration of a novel shared task at FIRE 2025. This task invites participants to develop and evaluate systems capable of accepting a spoken query as input and retrieving relevant information from a document collection. The shared task aims to foster innovation in speech-based retrieval methods while promoting support for India's diverse linguistic landscape.

Overview of Task

Participants are provided with a text and a spoken query. They are required to either use the provided spoken query and generate a new set of spoken queries from the text query in different environments, and complete the following two tasks:

Task 1: Spoken Query Ad-Hoc Retrieval Data - Monolingual Task

Participants are required to develop a Spoken Query Retrieval System that handles monolingual queries. This task involves both the spoken queries and the corpus being in the same language, making the retrieval process more straightforward. The system should accurately interpret spoken queries and retrieve relevant documents from a corpus in the same language. This year, the languages involved in this task are English, Gujarati, Hindi, Bengali, and Kannada.

Task 2: Spoken Query Cross-Lingual Retrieval

Participants are required to develop a Spoken Query Retrieval System capable of handling cross-lingual queries. In this task, the spoken queries and the corpus are in different languages, adding complexity to the retrieval process. The system should accurately interpret spoken queries in one language and retrieve the most relevant documents from a corpus in another language. This year, the task will involve English, Gujarati, Hindi, Bengali, and Kannada. The language pairs for queries and corpus could be any combination of these languages, allowing participants to address various cross-lingual retrieval challenges.

Announcement

30 June 2025 - Data release for both tasks
11 Aug 2025 - Submission and Evaluation Details Release

Guidelines

Each team can have at most 4 participants.
A team can submit only one working note
Each team is required to submit a detailed description of their algorithm(s) and the environment(s) used to generate the spoken queries.
Participants are allowed to use any external pretrained models.

Page updated

Google Sites

Report abuse