We conducted the vulnerability data collection in three steps as follows:
Java programs collection: We first searched Java open-source programs with disclosed CVEs and corresponding patch commits from advisory sources such as NVD, Debian, and Red Hat Bugzilla, initially obtaining a list of 680 programs.
Version range extraction and method-level locating: We utilized the V-SZZ algorithm to extract the vulnerable version range of programs affected by each CVE, ensuring accurate identification of affected software versions. Concurrently, we employed Universal Ctags to locate method-level information for both the vulnerable and fixed versions, which is essential for a detailed analysis of the vulnerabilities.
Program packaging: Since the tools under evaluation accept different types of input (e.g., source code and binaries), we further excluded the programs that failed to be packaged. We finally obtained 165 package-able open-source programs.
Cross-validating: To ensure the accuracy of our benchmark, we engaged three security experts from our anonymous industry partner in this process. These experts, with over 7 years of Java programming experience and over 5 years of vulnerability knowledge, verified the vulnerability locations identified by our automated process and cross-validated each other's work. This process involved:
4.1 Tool results validation: Each expert independently verified the vulnerability locations identified by our automated process.
4.2 Code context understanding: If discrepancies were found, each expert deeply analyzed the code where the vulnerability was reported, examining the functionality of the specific code segment, its role within the whole program, and its interactions with other components.
4.3 Vulnerability and patch review: Each expert thoroughly reviewed the details provided in the vulnerability and patch information obtained from sources such as NVD, Debian, and Red Hat Bugzilla.
4.4 Independent cross-validation: After each expert independently reviewed the identified vulnerabilities and patches, they cross-validated each other's results.
4.5 Consensus building: If disagreements arose during the independent cross-validation, a majority voting was used to make the decision. In cases where the votes were evenly split, a discussion was held to resolve the conflict.
4.6 Final voting and labeling: Once a consensus was reached through discussion, a final vote was taken to confirm the decision. The vulnerability was then labeled with detailed information such as its location, the software versions affected, and the specific methods where the vulnerabilities and patches were located.
As shown in the sheets below, the real-world benchmark includes 165 unique CVEs, which are mapped and grouped into 37 CWE weaknesses and 8 CWE Classes in CWE-1000 View, with 768 vulnerable methods ("vul_method_cnt") and corresponding 891 fixed ones ("fix_method_cnt") totally.
To the best of our knowledge, it is the largest real-world Java vulnerability benchmark.
As displayed in Figure below, to further clarify the representativeness of the "real-world" dataset, we also displayed the popularity of all these programs by using GitHub stars(⋆), with average 3,108 stars till August 2022.
The Java CVE Benchmark data is available now, which contains vulnerable and patched versions for each program as follows:
Note: The naming convention for each CSV file within the zip files follows this structure: "Program Name", "CVE-ID", and "Version".
To illustrate, consider the file named "active-directory-plugin_CVE-2017-2649_active-directory-plugin-active-directory-2.2.csv". This name signifies that the CSV file contains information about the CVE-2017-2649 vulnerability, specifically in relation to the active-directory-plugin. The affected version of the program is represented as "active-directory-plugin-active-directory-2.2".
CSV File Content: Each CSV file contains label information for a specific CVE. This includes the Vulnerability Path (consisting of vulnerable files, line ranges, and method names) and the corresponding Fixed Path (featuring the fixed files, line ranges, and method names).
For example, with CVE-2017-2649, one of the vulnerable files is 'src/main/java/hudson/plugins/active_directory/ActiveDirectorySecur-ityRealm.java', with a line range of 208-210. The corresponding method is 'ActiveDirectorySecurityRealm'. The format for the Fixed Path information follows a similar structure.