Can We Detect ChatGPT-Generated Code?

An Empirical Study

This website is the supplementary materials of the paper "Can We Detect ChatGPT-Generated Code? An Empirical Study", which takes the presented results further by including experiments not shown in the paper due to the page limitation.

The website is organized as follows:

Overview & Datasets: In this section, we provide a comprehensive overview of how the datasets were collected, and we also provide the public links for two extensive datasets: CCD, which contains 467.5K code-related samples, and NLCD, which contains 25K natural language-related samples, for easy access.
RQ1 Effectiveness: In this section, we conduct comprehensive experiments on NL and Code datasets to evaluate the effectiveness of six existing detectors on both accuracy and balance metrics. Furthermore, we have explored each detector on the code-generation dataset separately for six specific programming languages. The results demonstrate that most detectors are more powerful on NL than on Code in detecting AIGC.
RQ2 Fine-tuning: In this section, we evaluate the effectiveness of fine-tuned binary classification on each dataset or a composite dataset by comparing results with and without tuning. The aim is to explore whether tuning on one dataset can expand the performance's generalization capacity on another internal or external domain. The results show that tuning on NL has limited generalization capacity, in contrast to code-tuning which has a generalized performance across internal and external domains.
RQ3 Robustness: In this section, we use various mutation methods from the nlpaug and code augmentation repositories to build corrupted datasets and test each detector's robustness. The experimental results reveal that fine-tuned models are generally more robust under our conservative mutation operators than the six selected detectors. However, the models' robustness can be affected in datasets that were not included in the fine-tuning dataset
RQ4 Human Study: In this section, we conduct a human study of the paper through few-shot and zero-shot methods, i.e., with or without examples, on different distinguishing tasks. The results also follow the finding that humans are better at distinguishing natural language data than code data.

Abstract

AIGC model that produces high-quality responses across various applications, including software development and maintenance. Numerous AIGC detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by ChatGPT remains unexplored. To fill this gap, in this paper, we present the first empirical study on evaluating existing AIGC detectors in the software domain.

We created a comprehensive dataset including 492.5K samples comprising code-related content produced by ChatGPT, encompassing popular software activities like Q&A (115K), code summarization (126K), and code generation (226.5K). We evaluated six AIGC detectors, consisting of three commercial and three open-source solutions, assessing their performance on this dataset.

Additionally, we conducted a human study to understand human detection capabilities and compare them with the existing AIGC detectors. Our results indicate that AIGC detectors demonstrate lower performance on code-related data compared to natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge. The human evaluation reveals that detection by humans is quite challenging.

Datasets

Unfortunately, the download link has been deleted, and we are currently awaiting re-submit for our paper.

To conduct this study and answer these questions, we constructed two datasets, namely the Code-Related Content Dataset (CCD) and the Natural Language-Related Content Dataset (NLCD), by generating related content using ChatGPT in the domains of programming and natural language, respectively.

CCD consists of 467.5K samples across three different code-related scenarios, i.e., Q&A from stack overflow (115K), code-to-text generation (126K), and text-to-code generation (226.5K).
NLCD, which contains 25K samples, was constructed by using ChatGPT to polish the content from Wikipedia. Note that each sample in CCD and NLCD is a pair including the human-generated data and ChatGPT-generated data.

Contributions

Extensive experiments have revealed that current AIGC detectors struggle to detect code-related data compared to natural language data. Although fine-tuning is able to improve performance, however, the generalization capacities are limited. A human study also suggests that humans encounter similar difficulties, particularly when dealing with code data, which can be like blindly guessing due to its complexity. Overall, the main contributions of our paper are summarized as follows:

• We conducted a comprehensive empirical study to evaluate the performance of six AIGC detectors, including three open-source detectors and three commercial detectors, on detecting code-related content generated by ChatGPT. To the best of our knowledge, this is the first study that specifically evaluates the performance of different AIGC detectors on code-related content generated by ChatGPT.

• We construct two large-scale datasets namely CCD and NLCD, consisting of 467.5K code-related samples and 25K natural language-related samples. We have made our code and data public to facilitate the following research.

• We conduct a human study to study the difficulty of detecting content generated by ChatGPT and compare it to the performance of AIGC detectors.

Google Sites

Report abuse