Task1: Generative AI based Software Metadata Classification
A binary code comment quality classification model needs to be augmented with generated code and comment pairs that can improve the accuracy of the model.
Input: a) 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful.
b) Code and Comment Pairs, written in C with generated labels of useful / not useful using any Large Language Model Architecture
Output: Classification model with and without the new set of code comment pairs and generated labels
Task2: Bring your own LLM
We will provide a parent LLM and a RAG (Retrieval-Augmented Generation) architecture specifically designed to generate metadata for C source code. The RAG system will include question-answer pairs and comments focused on understanding and interpreting C code snippets.
Participants will be required to scrape C code and associated comments from codebases and use this data to build a small language model from scratch. This small model can then be fine-tuned using responses generated by the larger parent LLM.
The generated comments for a set of 50 C programs will be evaluated based on their semantic similarity to human-written reference comments for the same code.
The task of the participants is to estimate, given a problem P and an LLM-generated solution S, a score which represents a likelihood of the generated solution S being a correct solution to P.
Run submission format:
The input dataset can be found at this dataset link. The participants need to output a tab-separated (tsv) file, where each line is of the following format:
<Problem ID (as in the input file)> <Solution id, (a number from 1 to 10)> <Predicted Likelihood>
Evaluation Metric:
The evaluation metric that we'll use to score each participant is nDCG of the ranking induced by the predicted scores. A good prediction will prefer a correct solution over an incorrect one, which means that a higher number of correct solutions towards the top of a ranked list of the solutions (of size 10) means a more effective predictor.Â