We first conduct a correlation analysis to study the relationships between program length (measured by Lines of Code, LoC for short) and the performance of LLMs on different tasks. Table 1 shows the correlation coefficients between program LoC and model performance on all types of tasks.
Table 1. Correlation between LoC and model performance on different tasks.
Generally, we can observe that most correlation coefficients are negative (marked in red), indicating that the performance of LLMs declines as the target program length increases in most cases. Nevertheless, we can still witness a few cases where the coefficients are positive (marked in green). Combining the analysis of Section 5.1 in the paper, we can find that they highly overlap with those tasks where LLMs underperform. For instance, the performance of CodeLlama suffers from a serious decline in the Selection task, where its correlation coefficient is also positive. Further, all LLMs achieve low performance in the Infilling task, and the correlation coefficient of most LLMs on this task is positive. It should be emphasized that LLMs can handle fewer, sometimes a very limited number of programs in these cases. Therefore, the correlation coefficients in these cases would face more perturbation from occasional exceptions, and thus may not be statistically significant.
We further study the relationships between program cyclomatic complexity (CC for short) and the performance of LLMs. Table 2 shows the correlation coefficients between program CC and model performance on all types of tasks.
Table 2. Correlation between CC and model performance on different tasks.
Similarly, most correlation coefficients are negative, indicating that the performance of LLMs declines as the complexity of the target program increases in most cases. The effect is especially significant when it comes to the Generation task, where all models demonstrate a greater negative impact (e.g., the -0.17/-0.11 coefficient for Magicoder, and the -0.15/-0.14 coefficient for Deepseek-Coder). Similar to the phenomenon witnessed earlier, most positive coefficients appear in the Infilling task, where all models demonstrate undersatisfactory performance. Compared to different models evaluated, GPT-4 yields the most positive coefficients among all tasks. This possibly implies better resistance against programs with more complex structures.