Overview
Among the 3,785 remaining commits, there are some instances of false positives that are not related to bugs. This discrepancy primarily arises from relying on keyword filtering, Furthermore, our unit test-based verification algorithm does not infallibly distinguish between actual bugs and non-bug commits. To address these challenges and ensure accuracy, we have performed a rigorous human annotation process for further validation of these commits.
Contents
Filter out Policies
32.5% defects in BugsCpp[13] may not be valid.
Human Annotation
Filter out Policies
Manually analysing the filtered commits, we find that many commits do not meet the criteria required for our dataset. For example, some commits only add features or modify the output format of the program, instead of fixing bugs; some commit messages are very brief (e.g., only have fix bug) and do not logically correspond to the context of the fixed code; some commits are reverted in subsequent software iterations, thus are not considered as the reliable bug fixes. Following our meticulous annotation, we finally identified 248 bugs in total to construct the dataset.
The left image figures out the percentage of different policies regarding the Modify Output is the most reason when we ignore it. meanwhile, the Revert is the smallest possibility reason when we did not select as a bug.
These are invalid defects from BugsCpp[13], which we annotated as invalid
Dataset quality concerns: Even BugsCpp[13], the most comprehensive real-world C/C++ dataset with 209 defects, exhibits significant quality issues. Our systematic analysis revealed 68 defects (32.5%) with validity concerns:
1 empty defect
18 defects lacking executable tests or missing buggy source files
16 defects with missing or multiple commit messages
3 cases involving feature additions rather than bug fixes
7 cases spanning multiple files (complicating localization)
23 cases with excessively large buggy hunks (≥100 lines), with the largest gap reaching 16K lines—far exceeding practical LLM context limits
more detail in
https://docs.google.com/spreadsheets/d/1snTk0-tt6UUK_Gy_cVkIIUyYmupdQ3lf3_OUqOTjiD4/edit?usp=sharing
Case study :
case study for incorrection of BUGSCPP: for modify the output, the origin commit is here , this patch is regarded as a bug in BUGSCPP, while we filter out this because it is a string format or string tidy.
case study for incorrection of BUGSCPP: for feature adding, the origin commit is here , this patch is regarded as a bug in BUGSCPP, while we filter out this because it is a feature level modifcation.
case study for incorrection of BUGSCPP: for comment is too short which is "fix clang testsuite crash" that didnot indicate any sense about bug, the origin commit is here .
Human Annotation
In this section, we provide a detailed analysis in terms of the manual annotation result of bugs in \dataset here.
Signature
A total of 75 bugs were categorized under Signature, characterized by modifications confined exclusively to code elements within a single line, which could be further subdivided into four subcategories based on their root causes as follows.
Incorrect Function Usage. This bug category frequently entails the misuse of functions, encompassing both third-party library functions and internal methods within code objects. Remedying these bugs typically involves substituting the fault function call with the correct one. Such corrections demand a comprehensive understanding of the overall software project, as well as a deep semantic grasp of the logic underlying the employed methods.
Fault Input Type. In statically typed languages, the accurate specification of variables and return value is crucial. Bugs in this category frequently arise from incorrect variable type assignments within the code, resulting in unforeseen errors.
Incorrect Function Return Value. During our analysis, it was observed that a significant number of bugs stem from improper settings of return values in specific condition structures or function calls. Rectifying these bugs typically necessitates altering the return value to align with the correct code logic. This correction process demands not only an understanding of the code's context but also a comprehensive semantic comprehension of the pertinent functions or conditional logic.
Incorrect Variable Usage. These bugs bear resemblance to the Incorrect Function Usage bugs; however, they primarily involve the improper use of variables, instead of functions. The erroneously used variable might appear independently in a code statement or within a function call. Consequently, these bugs, compared to bugs in the first subcategory, are often more complex and challenging to rectify due to their increased flexibility in occurrence.
For bugs categorized in Signature, while generally simpler to rectify, necessitate a substantial level of contextual understanding for accurate modification, particularly in selecting and utilizing the appropriate calling functions or variables.
Sanitizer
This category encompasses bugs that are fixed exclusively within the conditional logic of a single LoC, accounting for a total of 20 bugs. The root cause of these bugs can be classified as Control Expression Error. The modifications required to fix these types of bugs are usually minimal.This error resulted in the generation of false positive results.
Memory Error
We categorize bugs that trigger faulty memory behaviors into a separate category. This is particularly relevant in memory-unsafe languages like C and C++, where numerous bugs associated with memory can lead to serious consequences. In the dataset, we have summarized three categories of memory bugs and classified them according to the CWE, as follows:
Null Pointer Dereference. These vulnerabilities could refer to CWE-476, which occurs in software when a pointer is used without properly checking if its value is NULL, leading to program crashes or other undefined behaviors.
Uncontrolled Resource Consumption. These vulnerabilities correspond to CWE-400, which can lead to resource exhaustion, thereby impacting the system's performance or stability. Notably, 45.0\% of the memory-related bugs fall into this category.
Memory Overflow. These types of bugs mainly relate to memory overflow vulnerabilities (e.g., CWE-122, CWE-121, etc.). Such bugs often involve the leakage of sensitive memory information and pose serious security risks.
Logic Organization
Among the 87 bug fixes that involve modifications across multiple LoC, it has been observed that these bugs are frequently associated with the handling and organization of code logic.
Improper Condition Organization There are 67 bugs classified into this subcategory, which can correspond to CWE-391. These bugs often involve improper wrappings of condition logic.
Wrong Function Call Sequence The root cause of this bug category could align with CWE-691. Such bugs typically arise from incorrect code-calling logic. Consequently, the bug fixes of these bugs involve relocating one or more complete code blocks to different locations, without altering the content within these blocks~\cite{logic2-example.
For bugs that are related to Logic Organization, it is often required to obtain a deep semantic understanding and the ability to logically analyze the function calls related to the context for fixing, which exhibits advanced characteristics for evaluating the repair capabilities of LLM.