Information Retrieval in Software Engineering

(IRSE)

@ FIRE 2025

12th-15th December, 2025

Task Descriptions

Task1: Generative AI based Software Metadata Classification

A binary code comment quality classification model needs to be augmented with generated code and comment pairs that can improve the accuracy of the model.

Input: a) 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful.

b) Code and Comment Pairs, written in C with generated labels of useful / not useful using any Large Language Model Architecture

Output: Classification model with and without the new set of code comment pairs and generated labels

Task2: Bring your own LLM

We will provide a parent LLM and a RAG (Retrieval-Augmented Generation) architecture specifically designed to generate metadata for C source code. The RAG system will include question-answer pairs and comments focused on understanding and interpreting C code snippets.
Participants will be required to scrape C code and associated comments from codebases and use this data to build a small language model from scratch. This small model can then be fine-tuned using responses generated by the larger parent LLM.
The generated comments for a set of 50 C programs will be evaluated based on their semantic similarity to human-written reference comments for the same code.

The task of the participants is to estimate, given a problem P and an LLM-generated solution S, a score which represents a likelihood of the generated solution S being a correct solution to P.

Run submission format:

The input dataset can be found at this dataset link. The participants need to output a tab-separated (tsv) file, where each line is of the following format:

Evaluation Metric:

The evaluation metric that we'll use to score each participant is nDCG of the ranking induced by the predicted scores. A good prediction will prefer a correct solution over an incorrect one, which means that a higher number of correct solutions towards the top of a ranked list of the solutions (of size 10) means a more effective predictor.

Page updated

Google Sites

Report abuse