RQ4: Root Causes and Potential Mitigation Strategies for Underperforming Cases: What are the underlying causes for the underperformance of ChatGPT, and how can we mitigate these challenges?
RQ4: Root Causes and Potential Mitigation Strategies for Underperforming Cases: What are the underlying causes for the underperformance of ChatGPT, and how can we mitigate these challenges?
Inaccurate Measurement Category refers to false positive cases where the predicted refinement by ChatGPT is correct based on our manual inspection, but the measurement metrics, such as EM or EM-trim, are low due to the strict matching. Four types of root causes were identified in this category:
Insignificant Omission (IO), where ChatGPT did not return unmodified code segments but correctly returned the modified parts;
Unexpected Grammar Fix (UGF), where ChatGPT fixed grammar errors in the documentation that were not present in the ground truth revised code;
Code Style Difference (CSD), where the predicted code by ChatGPT is semantically identical to the ground truth revised code, with differences only in whitespace, line breaks, and other code style aspects that do not affect code semantics, and the review comment did not explicitly prohibit the change of code style.
Reasonable Improvement (RI), where the predicted code by ChatGPT is highly reasonable, and even more suitable than the existing new code.
Incorrect Prediction Category refers to true positive cases where ChatGPT made incorrect answers compared to the ground truth revised code. We identified three types of root causes in this category.
Need Domain Knowledge (NDK) refers to cases where the review comment does not provide the necessary repository-related domain knowledge to complete the modification
Unclear Location (UL) refers to cases where the review comment does not provide a specific location for the code to be modified.
Unclear Changes (UC) refers to cases where the review comment has a lower information level, causing ChatGPT to be unable to determine the specific modifications needed, resulting in underperformance
As presented in Table 5, 51 (20.39%) of the underperforming cases were caused by inaccurate EM measurement. For the remaining 164 (79.61%) cases where ChatGPT outputs incorrect answers, the majority 107 (51.94%) cases were caused by the lack of domain knowledge required to complete the modification. Another 44 cases (21.36%) were due to unclear location information in the review comment, while 13 cases (6.31%) were caused by unclear instructions provided in the review comments.
Detailed information of all the 206 erroneous data is listed in the table below. Due to limited table space, old code and new code are not listed in the table, but they can be viewed from the original data based on the Original_id. The original data for codereview type is in codereview.jsonl, and the original data for codereview_new type is in codereview_new.jsonl.
Table 6 shows the results with different mitigation strategies. The rows UL and UC refer to the cases under Unclear Location and Unclear Changes, respectively. The results show that GPT-3.5, combined with the corresponding mitigation techniques, can resolve 30/44 (68.18%) of Unclear Location cases and 7/13 (53.85%) of Unclear Changes cases. By simply switching to GPT-4 without using mitigation techniques, it can resolve cases very close to those addressed by GPT-3.5 with mitigation techniques. After applying the mitigation techniques, GPT-4 can resolve 42/44 (95.45%) of Unclear Location and 12/13 (92.31%) of Unclear Changes cases.