RQ1 Results

Research Question 1: Can ChatGPT refactor code well?

RQ1 Google Form

This was the Google Form that we created to keep track of the results for RQ1. We would fill out the form for each of the 40 files for each of the 8 quality attributes. In total, we completed this form 320 times.

Important Documents

PMD Results

PMD Violations Spreadsheet

In this spreadsheet is a list of the 40 code segments we analyzed with the total number of PMD violations as well as the list of the specific violations in the adjacent column. If you scroll down past the table and to the right, there is a table categorizing the PMD violations and the number of violations in each category. Lastly, next to that table as well as on the last sheet is a pie chart of the PMD violations for each attribute. There are 8 sheets for each of the 8 quality attributes we are studying with an additional sheet for all the results as pie charts.

Comparing Changes For Each Attribute

Categories of the Changes Made for Each Quality Attribute

This spreadsheet is an accumulation of the changes that ChatGPT made for the code segments based on the quality attributes. There are columns for changes made for all, some, or one attribute.

Across all 40 code segments for each of the 8 quality attributes, ChatGPT successfully refactored each one with the exception of one. However, the nature of the changes varied. In some cases, the modifications primarily involved renaming variables and reformatting the code to adhere to Java conventions. In other instances, the suggested changes extended to transforming enhanced loops into for loops and switching data structures. Nevertheless, ChatGPT consistently provided valuable and applicable suggestions, ranging from minor to more substantial alterations, for almost all code segments.

Expand to See the Summary of the Changes Made For Each Attribute

Performance -

After ChatGPT refactored the 40 files of code segments for quality and performance improvement, we noticed little to no change between the refactored and the original code. The little changes we noticed were modifications such as changing an ArrayList declaration to a List declaration, changing an if-statement to a while loop, if it was already a while loop, then it would add a case that was not accounted for, and it would change a for each loop into a for loop. Lastly, if addLast() or addFirst() were used, ChatGPT would just use add() instead. Changing an ArrayList to List does help with performance since going through a List is faster. Additionally, changing an if-statement to a while loop does not affect performance unless the code is being run a lot of times, then a while loop is better. However, the other changes have no effect on performance.

Complexity-

After ChatGPT refactored the files for quality and complexity improvement, there was little to no change between the refactored and the original code. The minor changes we noticed were adjustments such as changing an ArrayList declaration to a List declaration and changing an Object type to List<Object>. Changing an ArrayList to List does help with complexity since going through a List is faster. However, the other change has no effect on complexity.

Coupling-

When asking ChatGPT to refactor the 40 files to improve quality and coupling there were very few unique changes made to the code segments, especially when compared to the changes ChatGPT would make for the other quality attributes. In terms of coupling-specific refactoring changes, the only consistent change ChatGPT would make was removing unnecessary statements. Removing unnecessary statements not only helps to improve code readability but also helps to reduce the amount of coupling between different parts of the code. Some of the other modifications that coupling only shared with one or two other attributes are changing enhanced for loops to regular for loops and splitting calculations across multiple variables rather than doing it all in one.

Cohesion-

Asking ChatGPT to refactor the 40 files to improve quality and cohesion revealed only one cohesion-specific modification: adding in a ternary operator. Adding in a ternary operator makes it so conditional statements can be written in a more readable and concise way, thereby increasing cohesion. Additionally, cohesion shared similar modifications with both design size and complexity. ChatGPT’s refactoring for cohesion and design size both shared the trait of making variables and methods private and/or public. Refactoring to improve cohesion and complexity both involved the modification of replacing the equals() function with two equals signs (==).

Design Size-

When requesting ChatGPT to refactor the code segments for quality and design size improvement, we observed only one unique change: utilizing the length() method on an array instead of using a separate counter variable. This modification eliminated the need for an additional variable that would consume unnecessary memory, as the length() method achieved the same outcome. Furthermore, both design size and coupling shared the change of removing unused methods, while design size and cohesion shared the change of adding access modifiers to variables and methods.

Readability-

When prompting ChatGPT to refactor the code segment for improved quality and readability, we discovered a single change that stood out. This change involved importing the collections class to simplify the code segments by allowing repeated calls to Collections.max(). By importing the collections class, the code became more concise and easier to understand. Additionally, readability was one of three quality attributes that added comments to the code segments.

Reusability-

After cueing ChatGPT to refactor the code segments to specifically improve quality and reusability, we recognized four changes distinctly for the reusability quality attribute. These changes consisted of the addition of helper functions to prevent the code from having a singular crowded method, as well as a change to make variables final. There were also changes focused on the loops, specifically a change from a for loop to an enhanced for loop which makes the code more readable and reduces the chance of bugs, as well as a change in the loop conditions which made the code more concise. Reusability was also one of two quality attributes that moved variables inside the for loop and changed the index of loops.

Understandability-

When prompting ChatGPT to refactor the code to specifically improve quality and understandability, we identified one specific change that was unique to the refactored code with respect to understandability. This change aligned variable declarations and assignments which increased consistency. Additionally, understandability was one of three quality attributes that consistently added comments and split calculations into multiple variables, rather than doing the entire calculation in one.

All 8-

It's also worth noting that all 8 quality attributes shared 8 common modifications. These changes included actions such as renaming classes, methods, and variables, formatting the code, adjusting import statements, and adhering to Java conventions. One surprising discovery was that all attributes involved changing data types from wrapper classes to primitive types. These 8 changes that all attributes shared consisted of most of the changes ChatGPT made to each code segment.

PMD Violations Pie Charts

In order to evaluate the effectiveness of ChatGPT's refactoring techniques, we utilized Programming Mistake Detector (PMD) to detect multiple violations across various categories in each refactored code segment. These categories included Best Practices, Code Style, Design, Documentation, Error Prone, Performance, Multithreading, and Security. The violations identified within these categories have varied levels of severity, spanning from important, urgent, and critical to blocker.

*All the violations present within ChatGPT's refactored code identified by PMD and their necessary category are listed in the table on the right

*There were no violations of security present in all 40 code segments so it does not appear in the pie charts

Performance

There was a total of 505 violations which consisted of 6 blocker violations, 3 critical violations, 6 important violations, and 487 urgent violations. Code Style was the most common category at 58.6% of total violations. Within the Code Style category, LocalVariableCouldBeFinal and MethodArgumentCouldBeFinal were the most prominent violations at 127 and 83 instances respectively. The rest of the categories ranked from highest to lowest were Documentation at 18.6%, Best Practice at 8.7%, Design at 8.5%, Performance at 2.8%, Multithreading at 1.8%, and Error Prone at 1.0%.

Overall, ChatGPT’s refactoring of code with quality and performance in mind, it is noteworthy that for the PMD categories, Performance did not rank high even though ChatGPT was supposed to refactor the code for improved performance. For example, the most common performance violation was AvoidInstantiatingObjectsInLoops, which was mentioned a total of 10 times.

Complexity

There was a total of 508 violations which consisted of 5 blocker violations, 4 critical violations, 10 important violations, and 489 urgent violations. Code Style was the most common category at 58.9% of total violations. Within the Code Style category, LocalVariableCouldBeFinal and MethodArgumentCouldBeFinal were the most prominent violations at 129 and 81 instances respectively. The rest of the categories ranked from highest to lowest were Documentation at 18.9%, Design at 8.7%, Documentation at 8.1%, Performance at 2.8%, Multithreading at 1.8%, and Error Prone at 1.0%.

Overall, ChatGPT’s refactoring of code with quality and complexity in mind was also successful; however, compared to the other quality attributes, there was no significant difference to be made.

Coupling

There was a total of 534 violations and 15 compiler errors. There were 15 blocker errors, 5 critical, 502 urgent errors, and 12 important. The most prevalent category was Code Style with 61% of the violations followed by Documentation at 17.6%, Best Practices and Design tied at 8.2%, then Performance at 2.2%, Multithreading at 1.7%, and Error Prone at 0.9%. The two most common violations throughout all 40 files were LocalVariableCouldBeFinal with 131 appearances and MethodArgumentCouldBeFinal with 88 appearances, and both violations fell under the Code Style category.

When comparing these results against the other quality attributes it can be gathered that coupling files had the highest amount of compiler errors and the third highest amount of total violations.

Cohesion

There was a total of 518 violations and 0 compiler errors. There were 6 critical violations, 497 urgent, and 15 important violations. These violations fit into 7 major categories with 57.7% of the violations falling under the Code Style category, 18.3% under Documentation, 10.6% under the Best Practices category, 8.3% under Design, 2.3% fell under Performance, 1.7% under Multithreading, and 1% under the Error Prone category. From the category with the most violations, Code Style, the two most common violations were LocalVariableCouldBeFinal and MethodArgumentCouldBeFinal at 125 and 85 violations respectively.

Important factors to consider when comparing these PMD results against the other 7 quality attributes are that the cohesion files had the least amount of compiler errors and were middle-of-the-road in terms of total violations.

Design Size

There was a total of 535 violations, including 11 compiler errors. We discovered 504 urgent violations, 6 critical violations, 14 important violations, and 11 blocker violations. The most prevalent violation category, comprising 58.9% of the total violations, was Code Style. Within the Code Style category, two of the most common violations were 127 instances of LocalVariableCouldBeFinal and 89 instances of MethodArgumentCouldBeFinal. The remaining PMD violation categories, ranked from highest to lowest, were Documentation with 17.8%, Best Practices with 10.5%, Design with 8.0%, Performance with 2.2%, Multithreading with 1.7%, and Error-Prone with 0.9%.

When compared to the PMD violation results for the other quality attributes, it is noteworthy that design size exhibited the second-highest number of violations and the third-highest number of compiler errors.

Readability

There was a total of 552 violations, including 12 compiler errors. There were 522 urgent violations, 5 critical violations, 13 important violations, and 12 blocker violations. The most prevalent violation category, accounting for 55.1% of the total violations, was Code Style. Within the Code Style category, there were notable occurrences of violations such as 124 instances of LocalVariableCouldBeFinal and 34 instances of ShortVariable. The remaining PMD violation categories, ranked from highest to lowest, were Documentation with 17.2%, Best Practices with 14.5%, Design with 8.0%, Performance with 2.7%, Multithreading with 1.6%, and Error-Prone with 0.9%.

Compared to our PMD violation results for the other quality attributes, readability had the greatest number of violations and the second-greatest number of compiler errors. Also, readability had the lowest percentage of Code Style violations.

Reusability

There were 500 total violations, ranging in severity. Overall, there were 17 important violations, 470 urgent violations, 5 critical violations, and 8 blocker violations. The most prevalent violation category was Code Style, which accounted for 64.6% of the violations. Within the Code Style category, there were 139 instances of LocalVariableCouldBeFinal violations and 84 instances of MethodArgumentCouldBeFinal violations. The remaining PMD violation categories, ranked from highest to lowest, were Documentation with 17.4%, Design with 8.4%, Best Practices with 4.4%, Performance with 2.4%, Multithreading with 1.8%, and Error Prone with 1.0%.

In comparison to the PMD violation results for the remaining quality attributes, reusability had the least amount of violations and had the greatest percentage of violations for Code Style.

Understandability

There was a total of 532 violations, ranging in severity. There were 10 urgent violations, 516 important violations, 5 critical violations, and 1 blocker violation. The most common violation category was Code Style, which accounted for 57.1% of violations. Within the Code Style category, there were 136 instances of LocalVariableCouldBeFinal violations and 88 instances of MethodArgumentCouldBeFinal violations. The remaining PMD violation categories, ranked from highest to lowest, were Documentation with 17.9%, Best Practices with 11.5%, Design with 8.3%, Performance with 2.6%, Multithreading with 1.7%, and Error Prone with 0.9%.

In relation to the other attributes, understandability had the third-greatest number of violations and second-lowest number of errors. The PMD results for understandability were overall average.

PMD Violations Venn Diagrams

The following Venn Diagrams illustrate the overlapping occurrences of specific violations within a certain severity category [blocker, critical, urgent, or important] across the eight quality attributes

Blocker Violations

As shown in this diagram, there were a total of five different blocker violations. The violation pertaining to CyclomaticComplexity was exclusive to the performance attribute. The coupling, readability, and reusability attributes shared the violations of MethodNamingConventions. Additionally, the coupling, design size, readability, reusability, and understandability attributes all had violations of ClassNamingConventions. Lastly, the violations of FormalParameterNamingConventions and LocalVariableNamingConventions were common among the performance, complexity, coupling, design size, readability, and reusability attributes.

Critical Violations

This figure depicts how out of the total violations, only two were classified as critical violations. All the quality attributes, with the exception of performance, exhibited the violation known as SystemPrintLn. Additionally, all eight quality attributes were found to have a violation related to AvoidReassigningParameters.

Urgent Violations

This Venn diagram portrays a total of 34 violations categorized as urgent. Among the eight quality attributes, six had urgent violations that were specific to them. The performance attribute stood out with violations exclusively related to AvoidReassigningParameters and SystemPrintLn. The complexity attribute was unique in having the CommentSize violation, while the coupling attribute had the exclusive violation of ImmutableField. The cohesion attribute was found to have the exclusive violation NoPackage, and the readability attribute had only the violation MutableStaticState. The reusability attribute, on the other hand, had exclusive violations for ForLoopCanBeForeach and ControlStatementBraces. Thus, the design size and understandability attributes did not exhibit any distinct critical violations compared to the others.

The performance and complexity attributes shared the violation of ShortClassName. The complexity and reusability attributes had the violation UnnecessarySemicolon in common. Additionally, the performance, complexity, and coupling attributes shared the violations UnusedLocalVariable and UseDiamondOperator. Furthermore, the performance, complexity, cohesion, design size, readability, and understandability attributes were all affected by the DataClass violation.

With the exception of reusability, all quality attributes shared the violation UselessStringOf. Similarly, all quality attributes, excluding performance, shared the CyclomaticComplexity violation. Interestingly, all eight quality attributes had a total of 19 violations in common, including LooseCoupling, ReplaceVectorWithAList, AtLeastOneConstructor, CommentDefaultAccessModifier, ConfusingTernary, LocalVariableCouldBeFinal, LongVariable, MethodArgumentCouldBeFinal, OnlyOneReturn, ShortVariable, UseUnderscoreInNumericLiterals, CognitiveComplexity, UseUtilityClass, CommentRequired, AvoidLiteralsInIfCondition, CompareObjectsWithEquals, UseConcurrentHashMap, AvoidInstantiatingObjectsInLoops, and UseIndexOfChar.

Important Violations

This figure illustrates how there were a total of four important violations. The complexity, cohesion, and design size attributes had the OneDeclarationPerLine violation in common. In addition, the coupling, cohesion, design size, readability, and reusability attributes were all found to have the UnnecessaryImport violation. Lastly, all eight quality attributes shared the violations of ShortClassName and UseVarargs.

Go to RQ2

Go to RQ3

Page updated

Report abuse