Demystifying Memorization in LLM-based Program Repair via a General Hypothesis Testing Framework
Demystifying Memorization in LLM-based Program Repair via a General Hypothesis Testing Framework
Abstract
Large Language Models (LLMs) have achieved remarkable success in various applications, particularly in code-related tasks such as code generation and program repair, setting new performance benchmarks. However, the extensive use of large training corpora raises concerns about whether these achievements stem from genuine understanding or mere memorization of training data—a question often overlooked in current research. This paper aims to study the memorization issue within LLM-based program repair by investigating whether the correct patches generated by LLMs are the result of memorization. The key challenge lies in the absence of ground truth for confirming memorization, leading to various ad-hoc methods designed for its detection. To address this challenge, we first propose a general framework that formalizes memorization detection as a general hypothesis testing problem, where existing approaches can be unified by defining a low-probability event under the null hypothesis that the data is not memorized. The occurrence of such an event leads to the rejection of the null hypothesis, indicating potential memorization.
Based on this framework, we design two specific methods (i.e., low-probability events) to detect potential memorization: 1) basic ground-truth matching, and 2) reassessment after substantial code mutation. We investigate the memorization issue in LLM-based program repair using two datasets: Defects4J, a widely used benchmark that is likely included in the training data, and GitBug-Java, a new dataset that is unlikely to be part of the training data. Our findings reveal that a significant portion of correct patches exactly match the ground truths in Defects4J (e.g., 78.83% and 87.42% on GPT-3.5 and CodeLlama-7b, respectively). Moreover, even after significant modifications to the buggy code, where the original repairs should not be generated, a considerable percentage of bugs (e.g., 81.82% on GPT-3.5 and 88.24% on CodeLlama-7b) continue to be fixed exactly as in the original bug fixes, indicating a high likelihood of memorization. Furthermore, we evaluate existing memorization detection methods and demonstrate their ineffectiveness in this context (e.g., most AUROCs are below 0.5). The theoretical analysis under our hypothesis testing framework shows that their defined events may not meet the requirements for being low-probability. The study highlights the critical need for more robust and rigorous evaluations in LLM-based software engineering research, ensuring a clear distinction between true problem-solving capabilities and mere memorization.
Workflow of Our Study
Figure 1. Overview of Our Study
Using ChatRepair and MemPrompt, we repaired bugs from the Defects4J and GitBug-Java dataset, and collected the corresponding correct fixes, which are our primary focus. Our objective is to determine to what extent these correct fixes are a result of memorization. Given the inherent challenge of determining memorization in the absence of a ground-truth regarding memorization, we adopt three distinct strategies to approximate the presence of memorization: 1) Comparison with the ground-truth fixes to assess direct matches. 2) Analysis of how fixes change when the original buggy code is changed. 3) Application and assessment of existing methods designed to detect memorization.
Content
In the website, we provide more details including concrete bug id of those plausibly and correctly repaired bug cases and more experimental results. Please see more details as follows: