ICSE2024 - RQ4 Human Study

RQ4: How well can human distinguish contents generated by ChatGPT? To answer this research question, we conducted an online survey with experienced software developers and observed their ability in distinguishing machine-generated contents.

Answers to RQ4: Similar to the findings of existing detectors, humans are better at distinguishing natural language data than code data. Some commercial detectors even have better performance than humans in detecting ChatGPT-generated code.

Two Survey Demo

(click button"START YOUR DETECTION TASK" to start )

The survey with "example"

We randomly divided the participants into two groups, namely, the “Example” and “No-example” groups, with 14 and 13 responses, respectively. The participants from the “Example” group were shown an example for each task, which con- sists of a pair of contents generated by humans and ChatGPT, with the ground truth clearly labeled.

The survey with "no-example"

The "No-example" group was not shown any examples, and the decision was made solely based on the anticipant's judgment, without any reference to cues.

Result

The questionnaire consists of 50 questions covering 5 types of code-related data (i.e., Q&A-GPT, Code2Doc-GPT, APPS-GPT, CONCODE-GPT, Doc2Code-GPT), with 10 questions per type. For each question, either a natural-language text block or a code snippet is shown to the participants, and they may choose from one of the three options: “Human”, “ChatGPT”, and “Unclear”. The participants

Figure 2.Performance of human participants in the survey

The results, above table, collected from our online survey are presented in Figure 2. We show the average accuracy of both the “Example” and “No-example” groups on each type of code-related task. Overall, the “Example” group demonstrated a slightly better accuracy of 59.5% vs. 52.5% obtained by the “No-Example” group, across all tasks. Both groups did well on Q&A-GPT and Code2Doc-GPT, with an accuracy close to or above 60% and as high as 77.9%. The contents presented to the participants in those two tasks only contain natural language texts and do not contain any code snippets. An explanation for the better performance could be that natural language texts may reveal more hints in terms of the language patterns used, tone, and emotion conveyed. The survey respondents also ranked “language user expetitive/formulaic language patterns” as the most important factor making them believe a piece of content is generated by AI, among others, including “coherence and structure”, “tone and voice” and “emotional appeal”. For tasks involving only code, the respondents did not perform as well. On average, both groups got 47.1% of the questions correct, which is similar to random guesses.

We also compared the performance of human with some best- performing detectors on the same set of questions. On an interesting note, we also tested the performance of ChatGPT in identifying the contents generated by itself or human, by asking it the survey questions. Table 8 shows a summary of the comparison results. The accuracy is shown as the percentage of the correct answers to the survey questions. For natural-language contents, our human respondents performed comparably well with most detectors (except Comp-NL which was especially fine-tuned on the same dateset). For code snippets, the human subjects struggled and failed behind AITextClassifier, one of the best commercial detectors. The performance of ChatGPT was not ideal in this experiment.

Reproduce the RQ4, when the classifer is human

rq4_raw_data

In our study, we invited 50 experienced developers who have at least five years of programming experience to participate in an online survey. By March 2023, 27 of them have provided valid responses. We randomly divided the participants into two groups, namely, the “Example” and “No-example” groups, with 14 and 13 responses, respectively. The participants from the “Example” group were shown an example for each task, which consists of a pair of contents generated by humans and ChatGPT, with the ground truth clearly labeled. The “No-example” group was not shown any example.

Reproduce the RQ4, when the classifer is ChatGPT

rq4_chatgpt_results

This table is the result of asking ChatGPT to detect each sentence five times. Our prompt is: "The text is {Detection_Text}. Could you please help me recognize whether the text was produced by a human or an AI generator? Please return one word, 'human' or 'AI', at the beginning and then explain why.". Noted: the {Detection_Text} can be replaced with one of the 50 questions.