Evaluation for Defects4C

Overview

Recent works have confirmed that LLMs have strong capacities in repairing buggy code. To evaluate their performance on our collected C/C++ benchmark, we select several representative LLMs for evaluation. In particular, the evaluation settings are categorized into single-round and conversation-based repair with different evaluation metrics for evaluation.

Contents

Conversation-based repair
- Case study

Conversation-based repair

For the conversation-based repair experiment described in section 6.2 of the paper, we delineate the details of our experimental settings. Here we introduce two hyperparameters, m and n, representing the maximum number of repair attempts and the maximum conversation length in each attempt, with values of m and n being set as 10 and 3, respectively. Specifically, one repair attempt consists of three continuous conversations. The aim is not only to resolve failures but also to evolve its performance automatically by iteratively investigating failure points that occurred during the same attempt phase.

The following illustrates a conversation-based prompt, given the the buggy function F_n and error message M_n represents the nth conversation in one attempt.

At the beginning of this attempt, the F_0 and M_0 are extracted from our dataset concatenated to construct the prompt. At the patch's verification phase, for example, the first conversation phase, the LLM outputs a patch and evaluates with corresponding Unit Test cases getting a Error Message M_1, if it can pass all the test cases, this patch is considered plausible then the conversation stops and the repair process ends, otherwise this patch is invalid, and we will updated prompt format with its error message M_1 and updated buggy function F_1, to build a new prompt for the continuous conversation. Following this rationale, after the completion of three iterations of conversation, the repair will reset to the initial prompt again and start a new attempt loop.

Case study for Conversation-based repair (Successful repair by both GPT-3.5 and Phind34B)

Successful repair by both GPT-3.5 and Phind-CodeLlama-34B (Phind34B)

For describing this kind of bug, we take the bug function lyjson-number as an example Table 9. This bug belongs to Sanitizer category, in rectifying it, it is requisite to amend the expression to the right of the operand < from exponent to (exponent - minus). Both Phind34B and GPT-3.5 understand the buggy semantic and successfully output plausible patches to correct the bug. For Phind34B, it gives as same patch as the developers have provided. However, the patch generated by GPT-3.5, (exponent - 1), is semantically equivalent, because at the beginning of the function, the variable minus is initialized as 1 and is never modified later. As a result, these two patches are semantically equivalent, and both of them can assist the function in passing all the test cases.

Case study for Conversation-based repair (Successful repair by only GPT-3.5)

Successful repair by only GPT-3.5

We select a bug from project aws-c-common, as illustrated by Table 3, which can only be repaired by GPT-3.5 but not Phind34B. This bug belongs to Signature: Fault Input Type, in order to correct it, the type of the first parameter in function s_base64_get_decoded_value should be modified from char to unsigned char. In this case, GPT-3.5 can generate as same patch as the developers provide but Phind34B fails to output a plausible patch. Moreover, Phind34B is not able to comprehend the root cause that triggers the bug even if we have provided the information of fault localization. With this hint, Phind34B still insists the bug is triggered by the other elements and modifies code snippets somewhere else.

Case study for Conversation-based repair (Failed repair by both GPT-3.5 and Phind34B)

Failed repair by both GPT-3.5 and Phind34B

Given the low successful repair rate of LLMs on the the dataset, this kind of bug constitutes a substantial proportion of the dataset. In this section, we select bugs that are unable to be repaired in either GPT-3.5 or Phind34B, we take this bug, selected from project nng, as an example in Table 12. This bug exists in function nni_chunk_insert, belonging to category Memory Error: Uncontrolled Resource Consumption, in which the identifier ch->ch_ptr should be substituted by ch->ch_buf. Actually, both GPT-3.5 and Phind34B have made many attempts to repair the bug, but none of the patches work. Below are several patches that have been generated with a high frequency of occurrence: 1. The third parameter in callee function memmove is replaced by ch->ch_len - len. 2.The third parameter in callee function memmove is replaced by ch->ch_len - (ch->ch_ptr - ch->ch_buf)). However, both of them are far away from the correct patch provided by developers. But we find an interesting patch that only appears once among all the patches generated by GPT-3.5 under T set as 1, this patch tells adding an additional code line ch->ch_ptr = ch->ch_buf; behind the callee function memmove. As we can see, the keyword ch->ch_buf has appeared, and is also assigned to ch->ch_ptr, it's a partially correct patch. We think if the number of maximum repair attempts increases, this bug might be repaired, and more bugs that GPT-3.5 and Phind34B can generate partially correct patches will also be successfully repaired.

Page updated

Google Sites

Report abuse