Recent advancements in code large language models (codeLLMs) have made them essential tools in software engineering. However, their outputs occasionally include proprietary or sensitive code snippets, raising privacy and intellectual property concerns. Conducting code training data detection (TDD) is therefore essential for ensuring ethical and compliant deployment. Recent detection methods—such as Min-K% and its variants—have shown promise for natural language data. However, their effectiveness for code remains largely unclear, especially given code’s structured grammar and fundamentally different criteria for evaluating similarity.
To address this gap, in this paper, we systematically evaluate 7 state-of-the-art TDD methods on code data using 8 CodeLLMs. To enable this evaluation, we construct a comprehensive, function-level source code benchmark CodeSnitch, comprising 9000 code samples across three programming languages, each explicitly known to be either included in or excluded from CodeLLM training. Additionally, to better assess the effectiveness of existing TDD methods, we propose targeted data mutation strategies to evaluate the robustness of these detection approaches under three different settings. The mutation strategies and experimental setups are designed based on the established Type-1 to Type-4 code clone detection criteria.
This study enhances our understanding of existing TDD methods and lays the groundwork for developing more advanced approaches.