RQ2

What is the best way to prompt ChatGPT to get a satisfactory response about refactoring with the fewest possible interactions?

Motivations

The analysis of the ChatGPT conversations revealed that programmers had varying approaches to prompting ChatGPT for an answer. The objective of RQ2 was to categorize these prompts in order to study what prompt style was the most effective. Using the preliminary data found researching RQ1, we revisited all merged instances and put them into our prompt taxonomy. For each prompt category, we calculated the average number of turns. We aimed to deduce which prompts had the lowest average number of turns, significantly focusing on the efficiency of ChatGPT conversations, rather than just to what extent they were effective.

Results

Distribution of conversation lengths

To the left is the box plot representation of the distribution of conversation lengths for each category type. The mean number of turns for 'File' is drastically greater than the means for the 3 other categories. The lengths of 'File' instances also range from 1 to 54, not even including outliers with greater lengths than that. The reasons for this drastic difference is discussed in the Takeaways Section. 'Pull Request' and 'Commit' files have strikingly similar means and the same interquartile range. Both have a couple outliers exceeding this range.

Giving ChatGPT a direct piece of code to work with or telling it to take on a specific role or behavior using the "From now on act as..." prompt is the most efficient way to get a satisfactory response about refactoring with the fewest interactions. These two kinds of prompts are the most commonly employed suggesting that they are successful in producing acceptable outcomes with the least amount of involvement.

The bar graph to the left depicts the distribution of prompts across the instances in the `File' category. Most instances belong to the `Provide direct code' or `Generate code and refactor' categories. The table displays the average number of turns for each of the prompt categories. For the more populated categories, the average number of turns is fairly even. The remaining categories, while they do have lower averages, only include 1 or 2 instances in each category, making it difficult to draw conclusions. These instances could be outliers, or unrepresentative of the average turns if there was more data. However, the prompt categories that included code all had lower averages than the ones without. The prompts with the highest averages were (VI) 'General question before task' and (VII) 'Generate code and refactor.' This may be due to the higher complexity and large volume of content concerned with these tasks. We observed that some conversations beginning with a more general question included turns spent by the developer trying to better understand the content.

The 'File' category additionally had 2 unmerged, or 'No' instances. These two started with the 'Provide direct code' and 'Give code and ask to restructure' prompts, but they seem to be outliers in this sample section. The direct code instance provided extremely lengthy code and documentation, taking up multiple turns at the beginning. This suggests that being slightly specific can be more successful that giving ChatGPT every single piece of information about the project. For the second unmerged instance, the programmer believes ChatGPT failed due to the developer's own lack on knowledge in TypeScript. This suggests that ChatGPT should be used only supplementarily to the developer's own code, and the developer still needs to understand and work with the content produced by ChatGPT. This section also contained a fairly large percentage of 'Supplementary Info' instances, totaling 19 instances out of 56 unique instances in the 'File' category. In the majority of these, the content produced from the conversations were definitely used in the project, which was typically a website, README, or article. The rest of the instances contained conversations that didn't necessarily produce concrete content, but it was reasonable to assume the developer used the conversation as supplementary information. Even though ChatGPT didn't directly refactor code in these instances, it is important to note that ChatGPT was still successful in helping the developer in the realm of refactoring. In these situations, developers utilized similar prompt patterns, such as giving ChatGPT the role of an expert.

The figure on the left displays the distribution of prompts used in the Issue category. Since only 2 instances were merged to the final code, the only 2 successful prompts displayed were `Give code and ask to restructure' and `General question before task.' Both of these instances only resulted in 1 response from ChatGPT. Because the amount of merged instances in this category is small, it would be inaccurate to draw conclusions about the usage of these prompts when describing an issue in a Github repository. When examining the instances that were not merged, however, one of the most common prompts used are Prompt I (From now on act as...). This would suggest that, while it does not seem to achieve desired results, developers will likely prompt ChatGPT to act as a software specializing in their specific Issue before going on the describe the problem. A possible reason for the lack of merged results could be ChatGPT misunderstanding this prompt in certain contexts and failing to refactor the code while role playing as the described software.

The corresponding figure shows the distribution of prompts in the Pull Request category. The leading prompt within the 'Pull Request' category was 'Give code and ask to restructure,' with the next most frequent categories also showing that users of ChatGPT prefer to paste the code directly into the program and ask for reassessment. The prompt with the highest turn average was 'Give code and ask to diagnose the problem,' suggesting that ChatGPT may take longer to give the desired response when tasked with finding errors in sections of code with missing context. While many of the instances within the 'Pull Request' category were merged to the final repository, other instances that included code that were not merged often started with Prompt VI (General Question before Task). The lack of usage of the code resulting from these conversations could suggest that developers utilizing ChatGPT similar to a search engine when trying to resolve a general issue within their code will not lead to their desired results, especially if the code they are working on is not provided to ChatGPT.

Page updated

Report abuse