The datasets can be downloaded here.
The datasets include two benchmarks as listed below.
BigCloneBench [1] is a benchmark (98% clones are Type-III/Type-IV clones) that is widely used to detect code clones. It is mined from IJaDataset2.0 and confirmed by 3 experts. For comparison, we use the BigCloneBench dataset used in Wei and Li[2] that contains 9,134 code fragments.
Modified-BigCloneBench is the improved dataset derived from BigCloneBench, which is more appropriate for validating and comparing approaches for detecting Type-III/Type-IV clones.
[1] Jeffrey Svajlenko and Chanchal K Roy. 2015. Evaluating clone detection tools with bigclonebench. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).IEEE, 131–140.
[2] Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In IJCAI. 3034–30