RQ2: Effectiveness of ChatGPT on Code Refinement: How does ChatGPT's performance compare to state-of-the-art methods?

The results indicate that there is a considerable performance gap between ChatGPT and CodeReviewer on the CodeReview dataset in terms of both EM and BLEU scores. For instance, the EM-trim and BLEU-trim scores of CodeReviewer are 32.55 and 83.50, respectively, while ChatGPT scores are only 19.47 and 75.12, respectively. However, on the new dataset, ChatGPT scores significantly surpass those of CodeReviewer. For instance, the EM-trim and BLEU-trim scores of CodeReviewer decrease to 15.50 and 62.88, respectively. We conjecture that the superior performance of CodeReviewer on CodeReview mainly stems from pretraining and fine-tuning on the same data distribution, while ChatGPT has better generalization ability. Comparing the results of CodeReviewer on CodeReview-NewLanguage and CodeReview-NewTime, we found that CodeReviewer achieves higher EM and EM-trim values on CodeReview-NewTime than on CodeReview-NewLanguage , which indicates low generalization. This could also be due to the fact that the pre-training data for CodeReviewer does not include data for these new languages.

In contrast, ChatGPT achieves stable results across different datasets. In particular, the evaluation results suggest that ChatGPT performs better on CodeReview-New compared to CodeReview due to the higher quality of reviews in CodeReview-New.

CodeReviewer baseline technical details

CodeReviewer utilizes a T5 model architecture comprising 12 Transformer encoder layers and 12 decoder layers, amounting to 223 million parameters . The model is initialized using the weight parameters of CodeT5. Subsequently, the pre-training is carried out with three objectives: Diff Tag Prediction, Denoising Objective, and Review Comment Generation. In this study, we employed the same pre-trained CodeReviewer model and fine-tuned it using the CodeReview(train) and CodeReview(valid) datasets.

Since CodeReviewer did not provide the finetuned model, only the pre-trained model was provided. We followed their paper's code and finetuned our own model with the following training parameters: train_steps is 20000, train_batch_size is 24, learning_rate is 3e-4, and gradient_accu_steps is 1. We trained the model on an A5000 GPU for approximately 5 hours, and the training results achieved good performance on both the validation set and the test set (EM=33.3), slightly higher than the results reported in the original paper.

Quantitative Analysis：lower performance of CodeReviewer

We further conducted an in-depth analysis to understand the lower performance of CodeReviewer compared to ChatGPT on the new dataset. We identified 2,283 cases from the new dataset where ChatGPT provided a correct response while CodeReviewer did not. We randomly selected 150 of them for the manual analysis. Through our analysis, we identified 4 main root causes:

(34) Inaccurate understanding of the review content. We have observed that some code reviews contain unclear information, such as ambiguous location references, unclear changes, or requiring domain-specific knowledge, which is challenging for the CodeReviewer model to comprehend.

For example, in Figure I, CodeReviewer did not understand the intended changes very well, i.e., ``nit: indentation" in the review refers to the need to pay attention to the indentation level. CodeReviewer changed the first line from ``func" to ``function" and deleted the guard statement. In contrast, ChatGPT can accurately understand the meaning of the review and make the necessary modifications while providing the reasons for the changes.

(62) Over deletion. CodeReviewer model exhibits a tendency to inaccurately delete code snippets. Specifically, in 30 cases, the CodeReviewer model erroneously deleted correct code snippets that should have been preserved. Additionally, in 32 cases, the model deleted a significant portion of code snippets that required modifications, resulting in excessive deletions.

For example, as shown in Figure II, CodeReviewer knew that the review required the deletion of certain code. However, since it couldn't accurately identify the code to be removed, it chose to delete lines 3-5. In contrast, ChatGPT can identify the specific code that needs to be deleted.

(10) Extra modification. In some cases, CodeReviewer model may introduce unnecessary modifications to code snippets that do not require any changes.

For example, as shown in Figure III, CodeReviewer correctly understood the requirement in the review and changed the shouldApplyBlur variable to private type. However, CodeReviewer also made additional modifications by changing the types of both shouldApplyBlur and blurEffectView variables to const type. In contrast, ChatGPT accurately fulfilled the requirements of the review without any additional modifications.

(44) Hard to understand the ground truth provided in the code block. Our analysis has revealed that, in some cases, reviewers have accurately suggested changes within the code block. However, CodeReviewer fails to recognize that the code within these blocks represents the ground truth, leading to incorrect modifications.

Figure IV shows an example, where the code block contains the accurate reference for the suggested change. Ideally, CodeReviewer should make modifications based on this reference. Unfortunately, CodeReviewer fails to comprehend this crucial aspect and instead removes relevant code. On the other hand, ChatGPT demonstrates a better understanding of the relationship between the review and the code block, enabling it to provide more precise modifications accordingly.

In summary, the main root cause appears to be the understanding ability of the two models. The CodeReviewer model struggles with comprehending unclear reviews, ground truth code, and tends to make extra deletions or modifications. On the other hand, ChatGPT demonstrates a stronger ability to capture the underlying semantics accurately, making it have better precision in making changes.

Complete Data

The detailed data results are stored in the dataset and can be viewed in the 'codereview.jsonl' and 'codereview_new.jsonl'

Page updated

Google Sites

Report abuse