RQ1: Impact of ChatGPT Settings: How do different prompt and temperature settings affect ChatGPT's performance in the code refinement task?
RQ1: Impact of ChatGPT Settings: How do different prompt and temperature settings affect ChatGPT's performance in the code refinement task?
The table displays the results of our evaluation of ChatGPT under different temperature and prompt settings.
Notably, the evaluation results indicate that setting temperature to 0 achieves the best performance for each prompt. As the temperature increases, the performance of ChatGPT decreases significantly. For example, the temperature of 2.0 achieves the worst results.
Comparing the effects of different prompts under stable temperature settings (0, 0.5, and 1.0), we observed that P2 and P5 achieved significantly better results than others. This improvement could be attributed to the scenario description included in both prompt 2 and 5, where we instructed ChatGPT to act as a developer and modify the code based on the code review. This indicates that scenario information can improve ChatGPT's understanding and performance. Furthermore, we noticed that P3 performed significantly worse than P4, despite both prompts containing more requirement information. Sometimes, P3 even performed worse than the simplest prompt, P1. For example, P1 achieved higher EM-trim scores than P3 in all three temperature settings, but P1 was generally worse than P4. This indicates that while providing additional requirement information could be helpful (compared to P1 and P4), too much complex information could harm the performance (P3). It could be because detailed requirement information is more complex to understand by ChatGPT, leading to unstable results.
In the revision, we conducted supplementary experiments with a more fine-grained temperature interval of 0.1. Due to the limited budget on API calls, we limited the fine-grained experiments to only Prompt 2, which had demonstrated the best performance among the five prompts in the original paper. Similar to the large interval setting (0.5 interval), we used the same 500 sampled data instances from RQ1 for each temperature setting. The overall results, as presented in Table above, remained consistent with our observations from the large interval setting of 0.5.
We add the results of standard deviations for EM-T and BLEU-T scores at different temperature settings. Table above displays the results that indicate a noticeable increase in standard deviation as the temperature rises. The results show that, in most cases, the low temperature could have relatively smaller standard deviations. In most cases, lower temperature settings tend to produce better and more stable results.
In the revision, we conducted additional experiments. We randomly selected 1,000 data points from the training and validation sets due to the budget limit on running the entire dataset. Then we replicated the comparative experiments outlined in RQ1 on this new subset of data (i.e., 5 prompts and 10 repetitions, totally running 50,000 data via ChatGPT API). The results, presented in Table above, align closely with the findings in Table 2 of the original paper. Overall, the EM and BLEU metrics demonstrate similar performance with the performance on test data, reinforcing the consistent conclusions drawn regarding the impact of temperature and prompt settings in RQ1.
The detailed response results of ChatGPT are stored in the dataset and can be viewed in the '500cases.jsonl'.