Datasets

In our Experimental Design we use two datasets to validate our approach. Here we list binaries, source code and origin of the datasets.

Projects

This dataset is comprised of ten compiled Java projects extracted from the Qualitas.class Corpus. We rely on this dataset because it is publicly available and the projects have been already compiled. This (i) avoids any potential problem/inconsistency in compiling the projects, and (ii) ensures the reproducibility of the study. The selection of the ten projects aimed at obtaining a diverse dataset in terms of application domain and code size. The following table reports statistics of the dataset:

Projects

Libraries

This dataset comprises of 46 different apache-common libraries from the Apache Commons Project Distributions. We selected all the apache-common libraries for which we were able to identify both binaries and source code of the latest available release. We downloaded the compressed files for binaries and source code, located in the former the jar file which represents the library, and extracted from it all the .class files. The compressed source code files were instead simply decompressed. The following table reports statistics of the dataset:

Libraries