Answers to RQ3: The fine-tuned models are generally robust under our conservative mutation operators. However, the robustness of the models can be affected by some datasets if it is not included during fine-tuning.
By analyzing the results in columns πΆπππ and π»πππ , we can observe that, in natural language data, the models still perform well on human data but make some errors in ChatGPT-generated data that were correctly predicted before are now missed. On the contrary, in code data, the models can still detect the modified code correctly, but some human code that was correctly predicted before now cannot be detected. However, considering the overall performance of the models is not decreased a lot, there are cases where data that was previously mispredicted is now correctly predicted.
We selected the corrected sample as the seed to perform mutations on each dataset of detectors. We then evaluated the performance of each mutation set. The table shows the reproducibility of Human-consistency and chatGPT-consistency. The columns, raw_total, raw_correct, and mutate_correct represent the total number of original test sets, the count of corrected samples of the detector, and the count of correct samples after mutation. The πΆπππ and π»πππ represent the ratio of samples that can be correctly detected before and after mutation, respectively, indicating the consistency of the prediction.