Answers to RQ3: The mutation operators can effectively change the detection results of content generated by ChatGPT, but they may fail to change the results of human write content. This might be caused by the content generated by ChatGPT is more sensitive.
The study focuses on evaluating the robustness of the fine-tuned RoBERTa-QA detector on APPS-Code.ย
Initial seed comes from a Composite-Code tuned RoBERTa-QA that can correctly identify 3792 ChatGPT-generated codes and 1442 Human codes in the APPS-GPT test set.ย our mutates operator selects the 5 common cases by operating Python AST to change code and ensure any mutate operation passes the test.
Table7 show that, mutation operations effectively evade detection in ChatGPT-generated data, particularly FuncAddLine mutations, while VarRename is less effective. Mixing multiple mutations increases the probilaty of successful evasion.ย
Conversely, the mutation of human-write content has minimal impact on detector predictions, highlighting the greater sensitivity of ChatGPT-generated content to mutations.
We selected the corrected sample as the seed to perform mutations on each dataset of detectors. We then evaluated the performance of each mutation set. The table shows the reproducibility of Human-consistency and chatGPT-consistency. The columns, raw_total, raw_correct, and mutate_correct represent the total number of original test sets, the count of corrected samples of the detector, and the count of correct samples after mutation. The ๐ถ๐๐๐ and ๐ป๐๐๐ represent the ratio of samples that can be correctly detected before and after mutation, respectively, indicating the consistency of the prediction.