CCTEST identifies numerous defects when being used to test (commercial) code completion systems, despite the varying thresholds used in deciding outliers. We recommend configuring T = 9 as a presumably proper threshold (with the highest TP rates) in usage.
The average distribution across all code completion systems
We further measure the true positive (TP) rate of the outliers found under different thresholds T. For each pair of <T, model>, we randomly sample 100 cases from two datasets, resulting in a total of 4,000 (5×8×100) cases. The first two authors check each case to manually decide if an outlier is TP of false positives (FPs)