This website is the supplementary materials of the paper "Defects4C Benchmarking C/C++ Faults to Assess LLM-Based Program Repairs", which takes the presented results further by including experiments not shown in the paper due to the page limitation.
The website is organized as follows:
Overview & Datasets: In this section, we provide a comprehensive overview of how the datasets were collected, and we also provide the public links for the datasets: we tackle the identified challenges by introducing a new, high-quality C/C++ fault benchmark, referred to as Defects4C. Our approach focuses on extracting bug commits from widely used GitHub C/C++ repositories using a set of predefined bug-related keywords. Our thorough and systematic approach has yielded 248 normal bugs (Defects4C_bug) and 102 CVE bugs (Defects4C_vul) as well as their corresponding unit tests for verifying the existence of the vulnerability. Additionally, we have categorized these bugs into four distinct groups, as classified by security experts, to support and enhance future research using our dataset.
Benchmark Collection: We have developed and publicly released an executable C/C++ defect benchmark, Defects4C, comprising 248+102 real-world C/C++ bugs sourced from GitHub projects. this section explains the procedure about how to collect valide commit enpowered by BigQuery and CVE.
Human Annotation: This section address some challenges and ensure accuracy about commits might pertain to general code functionality changes rather than bug fixes, we have performed a rigorous human annotation process for further validation of these commits.
Evaluation on LLM: We conduct the first empirical study focused on assessing the capability of LLM-based APR techniques in repairing C/C++ programs. A comprehensive evaluation using a range of LLMs, such as GPT-3.5 Turbo, GPT-4, CodeLLAMA, and WizardCoder etc.. Our findings highlight a significant gap and limitations in the current APR methods when it comes to fixing C/C++ bugs, especially in contrast to their performance on Java bugs. These results underscore the urgent need for further research and development of C/C++ specific repair techniques and benchmarks.
Abstract
Automated program repair (APR) plays a pivotal role in ensuring the quality and reliability of software. However, most existing APR research focuses on Java programs, primarily due to the well-established benchmark, Defects4J. Despite the significant prevalence of C/C++ vulnerabilities, the field lacks extensive research on the automated repair of such vulnerabilities, largely attributed to the absence of high-quality datasets in this domain.
To fill this critical gap, this paper introduces Defects4C, a high-quality benchmark for C/C++ faults. To assess the effectiveness of existing state-of-the-art APR techniques in repairing C/C++ faults, we conduct a comprehensive empirical study using Defects4C. Our findings provide valuable insights into the capabilities and limitations of existing APR approaches for C/C++ programs, underscoring the necessity for novel APR techniques and the significance of Defects4C. This dataset marks a significant advancement in the field, offering a robust and comprehensive C/C++ dataset that is instrumental for future research.
Defects4C Tutorial
Requirements
Python >= 3.9
Git >= 1.9
Docker
Steps to set up Defects4C,
1. Clone Defects4C:
git clone https://github.com/defects4c/defects4c
We also uploaded the intermediate data into drive.google.com for reproducing.
2. Initialize Defects4C :
3. Change working environment into the docker container
4. Install the system level dependence from the container's inner side
This is a onetime setup, it takes 10-15 minutes in 128 cores, all bugs' env initial time only in here.
Using Defects4C
Command-line interface: Defects4C command