Deep Learning Code Similarities

RQ5: Can trained DL-based models be reused on

different, previously unseen projects?

One of the major drawbacks of DL-based models is their large training time with respect to other techniques. This training time could be amortized if these models could be reused across different projects belonging to different domains. The major factor that hinder the reusability of such models is the possible variability in the vocabulary for new unseen projects.

In this Research Question we show that we are able to successfully reuse an AST model trained on a given project to detect similar code fragments in a different project.

Reused AST Model

The following zip files contain all the candidates at method- and class-level extracted using a reused AST model (i.e., trained on the project Lucene and executed on all the other projects). The candidates have been compared with the original list of candidates available in RQ1.

Download Candidates Reused AST Model

Reused CloneDetector Model

The following zip file contains the training and test set of the reused CloneDetector model. In particular, the training set is formed only by the instances (manually validated) from one project (hibernate), while the test contains all the instances of the remaining 9 projects in the Projects dataset.

Download Dataset

Google Sites

Report abuse