Spoken-Query Cross-Lingual Information Retrieval for the Indic Languages (SqCLIR)

at FIRE 2024

Introduction

India is known for its linguistic diversity, featuring a multitude of languages. The Constitution of India recognizes 22 languages under the Eighth Schedule. These include Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Sindhi, Tamil, Telugu, Urdu, Bodo, Santhali, Maithili, and Dogri. Building a retrieval system that handles spoken queries in one of India's 22 officially recognized languages and locates relevant documents in a large knowledge base is multifaceted and complex. To our knowledge, spoken-query retrieval is a relatively underexplored area in information retrieval and natural language processing, and it is a multi-lingual version that includes under-resourced languages.

In addressing this challenge and exploring a new area, we offer a novel shared task for FIRE 2024 that will allow the development and evaluation of retrieval systems that receive a spoken query as input and search for answers in a document corpus.

Overview of Task

Task 1: Spoken Query Ad-Hoc Retrieval Data - Monolingual Task

Participants are required to develop a Spoken Query Retrieval System that handles monolingual queries. This task involves both the spoken queries and the corpus being in the same language, making the retrieval process more straightforward. The system should accurately interpret spoken queries and retrieve relevant documents from a corpus in the same language. This year, the languages involved in this task are English, Gujarati, Hindi, and Bengali.

Task 2: Spoken Query Cross-Lingual Retrieval

Participants are required to develop a Spoken Query Retrieval System capable of handling cross-lingual queries. In this task, the spoken queries and the corpus are in different languages, adding complexity to the retrieval process. The system should accurately interpret spoken queries in one language and retrieve the most relevant documents from a corpus in another language. This year, the task will involve English, Hindi, and Bengali. The language pairs for queries and corpus could be any combination of these languages, allowing participants to address various cross-lingual retrieval challenges.

Announcement

22 Aug 2024 - Training data release for both Task
08 Sep 2024 - Test spoken query release for both task

Guidelines

Each team can have at most 4 participants.
A team can submit up to 5 different runs for each language in task 1 and 5 different runs for task 2 each language pair and can submit only one working note
Each team is required to submit a detailed description of their algorithm(s)
Participants are allowed to use any external pretrained models.

Page updated

Google Sites

Report abuse