In paper, We only illustrate the results on Java250-S, here we present other results to investigate the effectiveness of existing retraining-based OOD mitigation methods on enhancing the generalization ability of code models.
Distribution Type: EDGE-AST; Dataset: Python75-S
Distribution Type: EDGE-CFG; Dataset: Python75-S
Distribution Type: EDGE-Dataflow; Dataset: Python75-S
Distribution Type: EDGE-PDG; Dataset: Python75-S
Distribution Type: EDGE-Reftype; Dataset: Python75-S
Distribution Type: EDGE-AST; Dataset: Python800
Distribution Type: EDGE-CFG; Dataset: Python800
Distribution Type: EDGE-Dataflow; Dataset: Python800
Distribution Type: EDGE-PDG; Dataset: Python800
Distribution Type: EDGE-Reftype; Dataset: Python800
Distribution Type: NODE-AST; Dataset: Python75-S
Distribution Type: NODE-CFG; Dataset: Python75-S
Distribution Type: NODE-Dataflow; Dataset: Python75-S
Distribution Type: NODE-PDG; Dataset: Python75-S
Distribution Type: NODE-Reftype; Dataset: Python75-S
Distribution Type: NODE-AST; Dataset: Python800
Distribution Type: NODE-CFG; Dataset: Python800
Distribution Type: NODE-Dataflow; Dataset: Python800
Distribution Type: NODE-PDG; Dataset: Python800
Distribution Type: NODE-Reftype; Dataset: Python800
Here, we present the AUROC scores of ODIN and Mahalanobis, respectively, with different parameter settings. Recall that ODIN includes the perturbation magnitude and temperature parameters. Mahalanobis only includes the perturbation magnitude.
The table below lists the AUROC scores measured by ODIN with different perturbation magnitudes. We also conducted experiments using different temperatures (1, 10, 100, 1000) and got the same results, hence, the results are not listed here. In general, changing the perturbation magnitude does not affect the conclusion. Additionally, using a different magnitude may result in a very different AUROC score. For example, in Python75, the scores on the token distribution shift are 77.39 and 71.12, respectively, when using 0.2 and 71.12 as the magnitude.
Table 2. AUROC results by the ODIN detector with different perturbation magnitudes. The highlighted column (magnitude=0.0014) is reported in the main paper. Min: minimum. Max: maximum. Std: standard deviation.
The table below lists the AUROC scores measured by the Mahalanobis detector with different perturbation magnitudes. Compared to Table 2, the result varies more than the ODIN detector and the standard deviation ranges from 0.77 to 4.35. For example, in Python75, the AUROC scores are 78.28 and 69.20 when using 0.001 and 0.005, respectively, as the magnitude. However, in most cases, we can still draw the conclusion that the task difference introduces the greatest distribution shift to the dataset.Â
Table 3. AUROC results by the Mahalanobis detector with different perturbation magnitudes. The highlighted column (magnitude=0.0014) is reported in the main paper. DNN: CNN (sequence). Min: minimum. Max: maximum. Std: standard deviation.