This website is the supplementary materials of the paper "Can We Detect ChatGPT-Generated Code? An Empirical Study", which takes the presented results further by including experiments not shown in the paper due to the page limitation.
The website is organized as follows:
Overview & Datasets: In this section, we provide a comprehensive overview of how the datasets were collected, and we also provide the public links for two extensive datasets: CCD, which contains 1.08M code-related samples, and NLCD, which contains 1.16M natural language-related samples, for easy access.
RQ1&2 Effectiveness: In this section, we conduct comprehensive experiments on NL and Code datasets to evaluate the effectiveness of six existing detectors on both accuracy and balance metrics. Furthermore, we have explored each detector on the code-generation dataset separately for six specific programming languages. The results demonstrate that most detectors are more powerful on NL than on Code in detecting AIGC.
RQ3 Fine-tuning: In this section, we evaluate the effectiveness of fine-tuned binary classification on each dataset or a composite dataset by comparing results with and without tuning. The aim is to explore whether tuning on one dataset can expand the performance's generalization capacity on another internal or external domain. The results show that tuning on NL has limited generalization capacity, in contrast to code-tuning which has a generalized performance across internal and external domains.
RQ4 Robustness: In this section, we use various mutation methods from the nlpaug and code augmentation repositories to build corrupted datasets and test each detector's robustness. The experimental results reveal that fine-tuned models are generally more robust under our conservative mutation operators than the six selected detectors.
AIGC model that produces high-quality responses across various applications, including software development and maintenance. Numerous LLMs detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by LLMs remains unexplored. To fill this gap, in this paper, we present the first empirical study on evaluating existing AIGC detectors in the software domain.
We created a comprehensive dataset including 2.23M samples comprising code-related content produced by three LLMs, encompassing popular software activities like Q&A (150K), code summarization (1.16M), and code generation (1.08M). We evaluated six AIGC detectors, consisting of three commercial and three open-source solutions, assessing their performance on this dataset.
Additionally, we conducted a human study to understand human detection capabilities and compare them with the existing AIGC detectors. Our results indicate that AIGC detectors demonstrate lower performance on code-related data compared to natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge. The human evaluation reveals that detection by humans is quite challenging.
To conduct this study and answer these questions, we constructed two datasets, namely the Code-Related Content Dataset (CCD) and the Natural Language-Related Content Dataset (NLCD), by generating related content using LLMs in the domains of programming and natural language, respectively.
CCD consists of 1.08M samples across three different code-related scenarios, i.e., text-to-code generation(1M), CONCDE (66K) and APPS code generation (8.7K).
NLCD, which contains 1.16M samples, was constructed by using ChatGPT to Q&A answer from stack overflow (150K), and Code2Doc(1M) from Code summarize.
Note that each sample in CCD and NLCD is a pair including the human-generated data and LLM-generated data. clearly keep the prompt, code truth, and a cycle of Doc2Code and Code2Doc, we would want to easily foster future code detection, program synthesis etc.
Extensive experiments have revealed that current AIGC detectors struggle to detect code-related data compared to natural language data. Although fine-tuning is able to improve performance, however, the generalization capacities are limited. A human study also suggests that humans encounter similar difficulties, particularly when dealing with code data, which can be like blindly guessing due to its complexity. Overall, the main contributions of our paper are summarized as follows:
• We conducted a comprehensive empirical study to evaluate the performance of 13 AIGC detectors, including seven open-source detectors and six commercial detectors, on detecting code-related content generated by LLM. To the best of our knowledge, this is the first study that specifically evaluates the performance of different AIGC detectors on code-related content generated by LLMs, like ChatGPT, WizardCoder and CodeLlama.
• We construct two large-scale datasets namely CCD and NLCD, consisting of 1.08M code-related samples and 1.16M natural language-related samples. We have made our code and data public to facilitate the following research.
• We conduct a exntesive empirical study on robustness and human study to study the difficulty of detecting content generated by LLM and compare it to the performance of AIGC detectors.