RQ3: Strengths and Weaknesses of ChatGPT: In which cases does ChatGPT perform well or not?
RQ3: Strengths and Weaknesses of ChatGPT: In which cases does ChatGPT perform well or not?
Comment Relevance: measures the degree of relevance between the review comments and the code changes in the test dataset, reflecting the quality of the dataset. Despite heuristic rules that have been applied to clean and filter some low-quality code reviews, it is still challenging to ensure the quality of all test sets. Therefore, we first study whether the review comments are relevant to the changes. Specifically, the relevance of the comments is divided into three levels:
Level 1 (Not): There is no apparent relationship between the code change and the review comment.
Level 2 (Partial): The suggestions in the review comment are partially implemented in the code change, or some refinement in the code change is not present in the suggestions of the comment.
Level 3 (Perfect): The code changes strictly follow the review comment, and there is a clear correspondence between them. In other words, the suggestion of the review comment is fully implemented in the code change, and the code refinement is entirely contained within the review comment.
Comment Information: measures the sufficiency and clarity of the instructions contained in the comment regarding the code change, which reflects the difficulty for the contributor or a model to refine the code.
For example, a comment like ``There are spaces missing'' is more informative than ``This function name does not describe well what it does.'' We followed the definition of comment information from, and divided the comment information into three levels:
Level 1 (Vague Question): The review comment only gives a general direction for modification (e.g., ``we should maintain the consistency of variable naming'') without clear suggestions for changes.
Level 2 (Vague Suggestion): The review comment provides specific suggestions for modification (e.g., ``changing it with camel case style''), but does not directly specify the location of the code that should be modified.
Level 3 (Concrete Suggestion): The review comment includes explicit requests for adding or modifying code snippets (e.g., ``changing the variable name 'testfile' to 'testFile''') or explicitly identifies code snippets to be removed.
Figure 3 illustrates the results of ChatGPT on different comment relevance and information levels. The figure highlights that ChatGPT performs the best when the comments are classified as perfect relevance, outperforming both partial and not relevance levels. In addition, ChatGPT performs the best on reviews that contain concrete suggestion information, while performing similarly for vague suggestions and vague questions. The results imply that the quality of data significantly impacts ChatGPT's performance, as reviews with low relevance and low information do not provide enough context and information for ChatGPT to make accurate predictions.
Code Change Category is used to measure the intention of the code changes. We defined the categories based on our annotations. There are 4 major categories, including Documentation Category, Feature Category, Refactoring Category, and Documentation-and-Code Category.
Documentation Category represents code changes that only add, modify, or remove documentation. Modifications according to conventions (Documentation-conventions) may also involve additions, modifications, or deletions, but we separated them for easier analysis of the unique challenges it poses to the model's prediction of revised code.
Feature Category represents code changes in terms of functional logic, such as adding, modifying, or removing code.
Refactoring Category refers to non-functional code refactoring, including renaming code entities (Refactoring-rename), swapping two code snippets (Refactoring-swap), and updating code based on coding standards (Refactoring-conventions).
Documentation-and-Code Category represents code changes that include both documentation and code modifications.
The table summarizes the results across different code change categories. It shows that ChatGPT performs best in the Refactor category with an EM-trim of 37.50% and a BLEU-trim of 83.58%, indicating that ChatGPT has a good understanding of how to perform code refactoring. However, the Documentation-and-Code category is the weakest performing category, with an EM-trim of 0% and a BLEU-trim of 64.09%, which highlights the difficulty in making simultaneous changes to code and documentation while maintaining consistency. When comparing minor categories, ChatGPT is best at handling remove-type code changes, followed by modify and add categories. Additionally, we observed that some of predictions about updates and adds are actually correct, but do not strictly match the ground truth answers, which will be discussed in RQ4.
The results also suggest that ChatGPT is skilled at updating code based on conventions, with EM-trim values of 23.08% and 44.12% for Documentation-convention and Refactor-convention samples, respectively, while the average EM-trim for the Documentation and Refactor categories is lower at 17.78% and 37.50%, respectively.
The table above contains 400 data cases, which were randomly selected from Codereview and CodereviewNew datasets with 200 cases each. In addition to the data such as EM and EM_Trim, BLEU, and BLEU_Trim, the table also includes manual classifications of relevance, information, and category. Relevance_user1 represents the evaluation of the first expert on relevance, Relevance_user2 represents the evaluation of the second expert on relevance, and Relevance represents the score merged after discussion by the two experts.
Due to limited table space, old code and new code are not listed in the table, but they can be viewed from the original data based on the Original_id. The original data for codereview type is in codereview.jsonl, and the original data for codereview_new type is in codereview_new.jsonl.