Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study
Supplementary Mateirals
Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study
Supplementary Mateirals
Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT's underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.
The main focus of this paper is to evaluate and understand the capabilities of ChatGPT in code refinement tasks. The figure shows the overview of this paper. To conduct our study, we collect existing benchmarks, including the CodeReview dataset, and state-of-the-art code refinement tools such as CodeReviewer, for comparisons. However, given the potential risk that the dataset could be used to be trained in ChatGPT and CodeReviewer, we create a new code review dataset (named CodeReview-New) consisting of two parts: new code reviews from the same repositories as CodeReview dataset but collected more recently (i.e., CodeReview-NewTime), and code reviews from repositories using different languages that are not included in CodeReview dataset (i.e., CodeReview-NewLanague). We next introduce the research questions we aim to investigate and their relationships.
RQ1 Impact of ChatGPT Settings: How do different prompt and temperature settings affect ChatGPT's performance in the code refinement task?
RQ2 Effectiveness of ChatGPT on Code Refinement: How does ChatGPT's performance compare to state-of-the-art methods?
RQ3 Strengths and Weaknesses of ChatGPT: In which cases does ChatGPT perform well or not?
RQ4 Root Causes and Potential Mitigation Strategies for Underperforming Cases: What are the underlying causes for the underperformance of ChatGPT, and how can we mitigate these challenges?
RQ1 Impact of ChatGPT Settings: The configuration of parameters and temperatures has a significant impact on ChatGPT's performance on code refinement tasks. Lower temperature settings tend to produce better and more stable results. Prompts involving concise scenario descriptions tend to produce better results.
RQ2 Effectiveness of ChatGPT on Code Refinement: ChatGPT demonstrates better generalization capabilities than CodeReviewer and outperforms CodeReviewer on the newly collected data. However, its effectiveness is still limited, with EM-trim and BLEU-trim scores of only 22.78 and 76.55, respectively.
RQ3 Strengths and Weaknesses of ChatGPT: ChatGPT performs better on high-quality reviews with concrete suggestions, while its performance is worse on reviews with low relevance and low information. Furthermore, ChatGPT demonstrates the highest performance on code refactoring tasks, while its performance is lower on tasks that involve refining documentation and functionalities.
RQ4 Root Causes and Potential Mitigation Strategies for Underperforming Cases: The main root causes identified in our analysis were the lack of domain knowledge, unclear location, and unclear changes. Two potential directions for mitigating these issues were identified: improving the large language model, such as using GPT-4 instead of GPT-3.5, and improving the quality of reviews, such as providing more clear information.
The folder contains 7 files:
1. codereview.jsonl: CodeReview dataset, as well as the results of testing the CodeReviewer model and ChatGPT.
2. codereview_new.jsonl: CodeReview New dataset, as well as the results of testing the CodeReviewer model and ChatGPT.
3. testset_500cases.jsonl: Experimental results for RQ1.
4. trainset_1000cases.jsonl: Experimental results for RQ1 conducted on the CodeReview trainset.
5. RQ3_RQ4_score.jsonl: Experimental results for RQ3 and RQ4.
6. RQ3_codereviewer_rootcause.jsonl: Root cause analysis for CodeReviewer errors in RQ3.
7. The final_website folder contains the code for executing each RQ
The field explanations for codereview and codereview_new are as follows:
old: the code snippet to be modified
review: the review suggestion
new: the modified code snippet
commit_url: the URL of the submission
gpt_answer: the complete answer provided by ChatGPT
gpt_code: the code part in ChatGPT's answer
model_code: the code prediction given by the CodeReviewer model
language: the programming language of the code
gpt_em: whether gpt_code exactly matches new (EM)
gpt_em_trim: whether the trimmed gpt_code exactly matches new (EM-Trim)
The field explanations for the RQ1 cases are as follows:
new_answer_p_a_t: the answer provided by ChatGPT with the p-th prompt, the a-th attempt, and the temperature of t (20 for 2.0, 15 for 1.5, 10 for 1.0, 5 for 0.5, 0 for 0), for example, new_answer_1_2_5 represents the answer with the first prompt, the second attempt, and the temperature of 0.5
new_code_p_a_t: the code part in the answer provided by ChatGPT with the _p_a_t combination as described above.
It should be noted that due to the significant decline in the quality of generated results when using high temperatures, this experiment did not repeat the runs 10 times for temperature=1.5 and 2.