Overview
Figure 1
Our approach, defects4C_bug, focuses on extracting bug commits from widely used GitHub C/C++ repositories using a set of predefined bug-related keywords. We then develop a meticulously designed unit test pair verification algorithm to identify corresponding test cases for each bug fix. To ensure the benchmark quality, we incorporate a three-stage human annotation process, executed by three security experts. This process is crucial for filtering out false positives – instances where commit messages contain bug-related keywords but the code changes are not related to fixing bugs or security issues. Our thorough and systematic approach has yielded 248 bugs as well as their corresponding unit tests for verifying the existence of the vulnerability. Additionally, we have categorized these bugs into four distinct groups, as classified by security experts, to support and enhance future research using our dataset.
To understand the effectiveness of existing APR techniques in fixing C/C++ bugs, we conducted an empirical study using our Defects4C benchmark. The study focuses on the performance of LLM-based APR techniques, incorporating leading large language models such as GPT-3.5 turbo, GPT-4 various iterations of CODELLAMA, and WizardCoder. These models are tested in both single-round and conversation-based program repair scenarios. Our findings reveal a notable gap in the performance of state-of-the-art LLM-based APRs on C/C++ bugs compared to their success with the Defects4J benchmark. This discrepancy underscores the urgent need for APR techniques tailored specifically for C/C++ fault repair. Our newly developed Defects4C, with its high-quality and comprehensive dataset, is poised to be a resource for future research in testing and repairing C/C++ programs.
Contents
collect and validate from Bigquery+GitHub
Unit Test Pair verification
Collect validly content from Bigquery+GitHub
Bigquery +GHArchive
In particular, we collect the raw commits in open-source non-fork repositories which are written in C/C++ programming language with the permit of re-distribution license from January 2015 to August 2023 from the website GH Archive, For C or C++ language, we respectively select the top 500 projects ranked by their stars for repository collection. In this way, in total, we obtain 38M commits in these projects.
filter with license, open-source, non-fork and language for year-2017
use the python script to cross filter the commits from above results, we will get 38M commits. PS., the Table github_star is a collection to top500 C/C++ and star count >200, please check out from crawler source code in this url
2. Use the GitHub API to filter out invalid commit
Because the commits obtained through BigQuery queries may become invalid over time due to repository ownership transfer, archiving, and other reasons, we filter out these invalid commits. Through our rigorous deduplication process to remove duplicate commits, we obtain 9M commits.
The corresponding code can find from crawler source code with path at process_github_commit/run.sh
3. Modification Only at function level statement
Although keyword filtering can help us obtain potential bug commits, these commits may consist of multiple file modifications which are difficult to analyze and unsuitable for APR evaluation. Thus, we only keep the commits that have one function content changed for use. Furthermore, some of the selected commits lack an expert-written test suite for validation. We also filter out them to ensure the dataset is runnable and verifiable. Finally, we obtain 76K commits for usage.
In particular, we use the tree-sitter to parse the function definition from the source code, where the node type is "function_definition" and "function_declarator". we also tried clang to extract function statements, however, clang needs the header file and its other dependence to parse specifically for C++ files, when the source code and header file are separated into two files, i.e. the member function implemented in xx.cpp and defined in xx.h, the clang parse cannot precisely parse function statement in this case.
Unit Test Pair verification
With collecting 76K commits from above, each of them has the test suite which consists of multiple test cases. To accurately identify which test case in the test suite is paired with the current changed code, we design our unit test pair verification algorithm, which is based on two basic observations:
for a bug fix, there typically exists a unit test that passes on the corrected code version but fails on the buggy version in a single commit.
if any changes only happen at the inner side of the function statement, this fix may largely resolve the buggy statement and be less related to the compiler or environment perturbation.
Docker and Compiler Configuration
Each project within our system is furnished with an individual Docker file, thereby establishing a uniform execution environment. Both Docker configurations are build for Ubuntu 20.04-x86_64, accommodating either clang-16 or GCC-9 as the designated compilers. Specifically, projects such as awslabs/aws-c-common, DynamoRIO/dynamorio, llvm/llvm-project, skypjack/entt, KhronosGroup/SPIRV-Tools, and facebook/rocksdb are compiled with GCC-9, while clang-16 is employed for other projects.
Compilation Flags and Dependency Management
Compilation flags are derived from the CI script or CMakefile.txt from each project's GitHub. In terms of compilation variables, uniformity is rigorously kept between the before-commit and after-commit stages of a bug, ensuring replicated and stability. Dependencies are split into system-level and user-defined, with formal libraries installed during the Docker image building phase, and the latter will installed during its project initial phase. It is noteworthy that each identified bug has specific library requirements, including specifications dependence version or compilation flags, and more detail is presented on our website.
Unit Test Reporting
The build tool employed across all projects is CMake version 2.6, with Ninja utilized for building, and ctest employed to generate JUnit-style Unit Test reports. Test cases are extracted from these reports by navigating to any leaf node labeled "testcase". Test error messages are derived from the test report, while compilation errors are collected from the CMake error report. For projects equipped with its own test frameworks, such as llvm/llvm-project, adherence to their respective test pipelines, e.g. invoking llvm-lit. For the remaining projects, the testing process is executed through the ctest CLI interface. The timeout duration for each bug in Unit Test's pair verification and patch verification is consistently set at same timeout.
We introduce the Defects4C_bug pipeline here, the Defects4C_vul can be obtained as same as, please refer to the paper for more details about Defects4C_vul.