Experiments

Justification for Ablation Study (For Rebuttal)

The Necessity of Initial Retrieval Phase & Ablation Study-ReviewerB,C

The inception of PatchFinder's two-phase design was driven by both *theoretical considerations* and *preliminary observations*.

Theoretically, we recognized the challenges posed by direct retrieval using a fine-tuned LLM in extensive while extremely imbalanced datasets could diffuse the model's attention, reducing its effectiveness in accurately identifying relevant patches. To minimize the training loss, the features of the minority class (i.e. security patches) are easily treated as noise and are often ignored. Thus, there is a high probability of misclassification of the minority class as compared to the majority class (i.e. non-patch commits). Modifying the loss function alone results in a substantial computational burden during the fine-tuning process of LLMs on datasets exceeding 20 million entries.

Preliminary observations further validated our theoretical concerns. In early trials of this work, we noted the LLM's limitations in direct retrieval from vast datasets, corroborating the need for a preliminary filtering phase to enhance focus and efficiency. We present these initial findings ahead of the implementation of PatchFinder as follows:

Phase-2 Only, isolated from the initial filtering, significantly underperforms in handling the complete dataset, with Recall@K peaking at only 10.42% at K=100, which highlights the LLM's limitations with vast, imbalanced datasets (1:5000 patch to non-patch ratio). This demonstrates the critical role of Phase-1 in refining the dataset for applying LLMs to this task. It is obvious that Phase-1 is necessary to ensure focused and effective analysis (from **1:5000** to **1:100**).

Phase-1 Only effectively condenses the dataset, ensuring that the LLM's analysis in Phase 2 is directed towards a more refined set of candidates. It shows this phase's impact with Recall@K reaching up to 80.42% at K=100 and an MRR of 0.4827, highlighting its crucial role in maintaining high coverage.

This ablation study shows that this two-phase approach is vital for tackling the challenges mentioned earlier. We believe this quantitative result would make the motivation clearer for a two-phase design, and we will update the ablation study in the final version.

GPT-4 & CodeLlama

We also considered the capability of popular LLMs including GPT-4 and CodeLlama for this task.

Regarding GPT-4, while its capabilities in Automatic Program Repair are notable as shown in ChatRepair [1], its application in our large-scale analysis (21,781,044 <CVE description, code commit> pairs in total) faced considerable financial cost [2].

For a clearer understanding, let's consider a quantitative analysis of the financial cost involved in using GPT-4 for our task. Each CVE analysis requires processing a significant number of commits, averaging around 5000. Given the token limit (combining 128 tokens for the CVE description and 512 tokens for each commit), the cost calculation would be as follows:

Cost per CVE=(128 tokens (CVE)+512 tokens (Commit))×5000 commits/1000 tokens×$0.03/1000= $9.6

*Note: This is only the cost of the prompt tokens ($0.03/1k prompt tokens) based on 8k context, where the sampled token price ($0.06/1k sampled tokens) is not yet included.

Considering the significant financial expenses involved, we believe it is not a good choice for practical use. Also, the large amount of these commits makes it impossible for users to manually query ChatGPT. All these factors underscore the impracticality of employing ChatGPT in this task.

In our trials with CodeLlama, we encountered challenges with efficiency and effectiveness. For a practical test involving 20 sampled CVEs, CodeLlama-7b-Instruct-hf required over two days (2 days, 5 hours, 27 minutes, and 30 seconds) to process 19,649 commits, covering less than 4 CVEs. We have uploaded the results of prompting CodeLlama below as well. This inefficiency, coupled with suboptimal results, led us to reconsider its suitability for our task: Locate the patch commit from around 5000 commits for each CVE.

Ultimately, we chose CodeReviewer for PatchFinder, aiming for a balance between effective patch tracing and practical considerations such as processing time and cost. This approach has allowed us to address the unique challenges in tracing security patches for CVEs in an efficient and effective manner.

Notably, PatchFinder achieves a Recall@10 of 80.63%, an MRR of 0.7951, and the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. It shows that CodeReviewer (220M parameters) has already achieved good performance on ranking the commits, hence there is little gain to consider a billion-level LLM like CodeLlama.

CodeLlama.zip (25MB)

🎯By leveraging PatchFinder, we are committed to enhancing cybersecurity by ensuring the integrity and reliability of these critical patches.

[1] Xia, Chunqiu Steven, and Lingming Zhang. "Keep the Conversation Going: Fixing 162 out of 337 Bugs for $0.42 Each Using ChatGPT." arXiv preprint arXiv:2304.00385 (2023).

[2] https://openai.com/pricing.

Google Sites

Report abuse