Information Retrieval in Software Engineering

(IRSE)

@ FIRE 2024

12th-15th December, 2024

Task Descriptions

Task1: Generative AI based Software Metadata Classification

A binary code comment quality classification model needs to be augmented with generated code and comment pairs that can improve the accuracy of the model.

Input: a) 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful.

b) Code and Comment Pairs, written in C with generated labels of useful / not useful using any Large Language Model Architecture

Output: Classification model with and without the new set of code comment pairs and generated labels

Task2: Code Quality Estimation

We are going to use the HumanEval dataset for this task. The input to the task is a focussed problem description (e.g., reversing a linked list) and a generated code snippet that solves the problem. We provide a JSON formatted file, where each line contains the following fields:

A visual illustration of a sample problem task and its solution extracted from this file is shown below:

The task of the participants is to estimate, given a problem P and an LLM-generated solution S, a score which represents a likelihood of the generated solution S being a correct solution to P.

Run submission format:

The input dataset can be found at this dataset link. The participants need to output a tab-separated (tsv) file, where each line is of the following format:

<Problem ID (as in the input file)> <Solution id, (a number from 1 to 10)> <Predicted Likelihood>

Evaluation Metric:

The evaluation metric that we'll use to score each participant is nDCG of the ranking induced by the predicted scores. A good prediction will prefer a correct solution over an incorrect one, which means that a higher number of correct solutions towards the top of a ranked list of the solutions (of size 10) means a more effective predictor.