Information Retrieval in Software Engineering
(IRSE)
@ FIRE 2024
12th-15th December, 2024
Task Descriptions
Task1: Generative AI based Software Metadata Classification
A binary code comment quality classification model needs to be augmented with generated code and comment pairs that can improve the accuracy of the model.
Input: a) 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful.
b) Code and Comment Pairs, written in C with generated labels of useful / not useful using any Large Language Model Architecture
Output: Classification model with and without the new set of code comment pairs and generated labels
Task2: Code Quality Estimation
We are going to use the HumanEval dataset for this task. The input to the task is a focussed problem description (e.g., reversing a linked list) and a generated code snippet that solves the problem. We provide a JSON formatted file, where each line contains the following fields:
task_id: A unique id for the problem task.
prompt: The problem description with an incomplete piece of code (e.g., a method signature, example outputs etc.).
code_<num>: A solution generated by the Codestral LLM. There are 10 solutions generated for a task, which are arranged by the order of preference (as obtained from the LLM).
A visual illustration of a sample problem task and its solution extracted from this file is shown below:
The task of the participants is to estimate, given a problem P and an LLM-generated solution S, a score which represents a likelihood of the generated solution S being a correct solution to P.
Run submission format:
The input dataset can be found at this dataset link. The participants need to output a tab-separated (tsv) file, where each line is of the following format:
<Problem ID (as in the input file)> <Solution id, (a number from 1 to 10)> <Predicted Likelihood>
Evaluation Metric:
The evaluation metric that we'll use to score each participant is nDCG of the ranking induced by the predicted scores. A good prediction will prefer a correct solution over an incorrect one, which means that a higher number of correct solutions towards the top of a ranked list of the solutions (of size 10) means a more effective predictor.