For this section, we conducted a total of six experiments. These include comparisons of token-level code completion and line-level code completion on the kNM-LM dataset, JavaCorpus dataset, and PY150 dataset. While the token-level results and part of line-level results (specifically, the EM value) from the kNM-LM dataset have already been presented in the paper, we will also display all the results on the website for the sake of completeness.
Fig. 5 shows the token-level completion results of various methods across fine-tuned models at different epochs. It is worth noting that the results are obtained via the retrieval on the corresponding fine-tuned models. Specifically, the datastores are updated in accordance with different fine-tuned models. The results for epoch 0 are derived from the pre-trained model without fine-tuning.
Overall, we observe a progressive improvement in the performance of the original model with an increasing number of fine-tuning epochs (see blue lines). The results of the retrieval-based methods also exhibit an upward trend, showing the generalization capability across different models. However, as the model goes through multiple fine-tuning epochs, the improvements are diminishing, which could be attributed to the performance enhancement of the fine-tuned models. Comparing FT2Ra to the baselines, it is clear that FT2Ra consistently outperforms the baselines on each fine-tuned model.
To assess how closely FT2Ra’s effect (simulating fine-tuning) aligns with real fine-tuning, we compare FT2Ra’s performance on the pre-trained model to that of the fine-tuned models. As indicated by the dotted line, FT2Ra, without fine-tuning the model, achieves similar performance to fine-tuned CodeGPT and UniXcoder models after approximately 4 and 7 epochs, respectively. In contrast, the best baseline, kNM-LM, only reaches a similar performance level with a model fine-tuned for about one epoch. These results underscore the value of our theoretical analysis from the fine-tuning process.
Fig.6 illustrates the results in terms of EM and ES for line-level completion. When compared to the token-level completion results in Fig. 5, it becomes evident that the impact of other baseline methods is notably diminished in line-level completion, primarily because this task is more difficult. We observe that BM25 and ReACC yield similar results, likely due to their adoption of similar methods. On the other hand, the performance of kNN-LM and kNM-LM is very close to that of the fine-tuned models, which indicates that they have limited improvement on the respective models.
Conversely, our approach FT2Ra continues to demonstrate clear advantages over other methods, due to its precise token prediction. Notably, when comparing the performance of FT2Ra at epoch 0 with that of other fine-tuned models, it becomes apparent that even without fine-tuning, FT2Ra can outperform the performance of fine-tuned models at 10 epochs (our longest fine-tuning round).
From the above picture, it can be seen that, whether it is the CodeGPT or UniXcoder model, after 2-3 rounds of fine-tuning, the model will converge, and the model accuracy will reach or approach its highest value. After the model converges, all retrieval augment methods have a very small increase. The FT2Ra method only improved by about 0.1%, but it is still the method with the most improvement among the three retrieval augment methods.
From the above chart, it can be observed that in line-level code completion tasks, whether it's the CodeGPT model or the UniXcoder model, FT2Ra shows a noticeable improvement and performs better than the other four retrieval augment methods. Especially for the UniXcoder model, after one round of fine-tuning followed by the effect of FT2Ra, the results from EM indicate that it can achieve the effect of the ninth round of fine-tuning, and from the ES results, it can achieve the effect of 4-5 rounds of fine-tuning. Although the performance on the JavaCorpus dataset is not as good as that on the kNM-LM dataset, it still aligns with the previous conclusion that the retrieval augment effect of FT2Ra can match the equivalent effect of multiple rounds of fine-tuning.
However, distinctions emerge when contrasting these outcomes with those from the earlier kNM-LM dataset. For both token-level and line-level results, the performance of FT2Ra without fine-tuning invariably fell short of its fine-tuned counterparts. Multiple factors contribute to this phenomenon. Firstly, the JavaCorpus training set is significantly larger, containing 7.2 million tokens, compared to the kNN-LM training set which has 2.4 million tokens. This means that the JavaCorpus dataset is roughly three times the size of the kNN-LM dataset, resulting in more pronounced effects during each fine-tuning epoch. Additionally, the JavaCorpus aligns more closely with the pre-training datasets of CodeGPT and UniXcoder. This alignment results in initial token-level accuracies for CodeGPT and UnixCoder of 64% and 65%, respectively, significantly surpassing the preliminary accuracy rates of 55% and 53% observed with the kNM-LM dataset. With subsequent fine-tuning iterations, accuracy swiftly exceeded 77%. This plateau can be interpreted as the model reaching convergence. For models that have converged, it is evident from the results that the incremental benefits offered by FT2Ra become less pronounced.
From the above chart, we can see that after one round of fine-tuning, the token-level prediction accuracy of both the CodeGPT model and the UniXcoder model increased from 53% and 60% respectively to 75%, and eventually converged to around 78%. Similar to the results on the JavaCorpus dataset, the FT2Ra method is slightly superior to the kNN-LM method and kNM-LM method, the two retrieval augment methods. However, the improvement is not significant. Starting from the first round of fine-tuning, the effect of the model using the FT2Ra retrieval augment method in each round can achieve the model effect of the next round of fine-tuning.
From the above chart, on examining the line-level outcomes, both the BM25 and ReACC methods showcased notable enhancements. This finding stands in stark contrast to observations made on the kNM-LM and JavaCorpus datasets. Such a disparity could either be inherent to the characteristics of the PY150 dataset or possibly because the ReACC retrieval method was specifically tailored for the Python language. Nonetheless, excluding the BM25 and ReACC methods, FT2Ra still outperformed other retrieval augment techniques (kNN-LM, kNM-LM). This conclusion aligns with previous observations from the kNM-LM and JavaCorpus datasets.
Similar to the results on the JavaCorpus dataset, after just one round of fine-tuning, whether it's the token-level result or the line-level result, the improvement in model outcomes from the retrieval-augmented methods based on output (including kNN-LM, kNM-LM, and FT2Ra) is not significant. The diminished impact observed can be credited to the sheer size of the PY150 training set, which contains 51.8 million tokens, in stark contrast to the 2.4 million tokens in the kNN-LM training set. This means that the PY150 dataset is roughly twenty-one times larger than that of the kNN-LM. Such a voluminous dataset induces a more rapid convergence of the model, rendering the improvements from FT2Ra relatively modest.
Through multiple rounds of fine-tuning comparison experiments, we found that when the model has not converged (as seen with the kNM-LM dataset results), FT2Ra can achieve very significant results. In the token-level comparison experiment, models that have not been fine-tuned and use the FT2Ra retrieval augmentation method can achieve accuracy levels comparable to the original model after 4-7 training rounds. In the line-level comparison experiment, the improvements in EM and ES values reached the level of the original model after 10 rounds of training.
However, if the training set is large, the model will converge quickly (as seen with the JavaCorpus and PY150 datasets). Under such circumstances, the FT2Ra method can still yield improvements, but they are notably more modest. When using the FT2Ra method on a model after the Nth round of fine-tuning, it can approximately achieve the results of the N+1th round of fine-tuning.
Delving deeper into the reasons, during the model training process, the last-hidden-state layer will also be trained thoroughly until it fully converges. At this point, the retrieved results will lose their diversity, thus losing the ability to rectify the original model. From another perspective, FT2Ra is essentially a kind of approximate training. If the model has already converged, the effects of further training will be significantly reduced, which is intuitively understandable.
We present the aforementioned data in the form of tables, with each set of results organized into distinct tabs for clearer representation of the numerical changes. The tabs are named according to the dataset and the type of code completion. Specifically:
For the kNM-LM dataset results:
kNM-LM dataset token: Results for token-level code completion.
kNM-LM dataset line: Results for line-level code completion.
For the JavaCorpus dataset results:
JavaCorpus token: Results for token-level code completion.
JavaCorpus line: Results for line-level code completion.
For the PY150 dataset results:
PY150 token: Results for token-level code completion.
PY150 line: Results for line-level code completion.
This tabbed structure ensures that users can easily navigate and compare the outcomes across different datasets and completion levels.
In this section, we will present the experimental results from the 'Effectiveness of the multiple iteration strategy' section of the paper in the form of tables, specifically the detailed data for Figure 7. This allows for a clearer view of the data variations. The table content includes results from the Rest., Eureka, JavaCorpus, and PY150 datasets, each placed in one of the four tabs.
To evaluate the effect of the multiple iteration strategy incorporated into FT2Ra, we configured FT2Ra with varying retrieval iterations (i.e., 𝐸 in Algo. 1) ranging from 1 to 10. We also consider the impact of the parameter 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 , which are configured with 4 values: 2.5, 5, 10, and 20. Evaluations were carried out using similar configurations as in RQ3, i.e., token-level completion on pre-trained models.
The results are presented in Fig. 7. It is obvious that FT2Ra’s performance benefits from multiple retrieval rounds, which is a unique feature compared to existing retrieval-based baselines. By increasing the number of retrieval rounds, the performance of FT2Ra gradually gets better. From the results, we found that the performance of FT2Ra tends to stabilize after approximately 4 retrieval iteration cycles. With respect to different learning rates (i.e., 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 ), FT2Ra’s performance shows high sensitivity to this parameter. In general, larger values of 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 enable FT2Ra to converge faster, whereas smaller ones necessitate multiple iterations. For instance, with 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 set to 0.5, convergence tends to be achieved after 10 iterations. In contrast, a setting of 4 for 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 reaches optimal performance after just one iteration. Yet, we also observed that excessively high learning rates could hamper FT2Ra’s performance. For instance, settings of 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 at 2 and 4 typically yield results inferior to those achieved with 0.5 and 1. The configuration using a value of 4 for 𝜂𝑙𝑜𝑔𝑖𝑡𝑠 achieves the poorest performance. The learning rate in FT2Ra shows a similar effect to the learning
rate of real training.