Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities.
To provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users, we evaluated seven representative Java SAST tools filtered from 161 tools according to the selection criteria. Then, we used Common Weakness Enumeration (CWE) as a reference to map the detecting rules of these tools and CVEs contained in our collected benchmark datasets to CWE, and automatically compared the effectiveness of each tool. We collected two types of benchmark datasets including a synthetic dataset (i.e., OWASP Benchmark) and a real-world vulnerability benchmark. The real-world benchmark includes 165 open-source Java programs with 165 unique CVEs. The dataset covers 37 unique vulnerability types (CWE Weaknesses), belonging to 8 CWE Classes in CWE-1000. For this, we evaluated the effectiveness of the selected tools against the OWASP Benchmark and the real-world benchmark. Based on their poor effectiveness on our real-world benchmark, we further dissected the composition of false negatives. Moreover, we performed a consistency evaluation on the vulnerabilities detected by these tools between the actually detected ones and what is claimed in the detecting rules. Finally, we performed a performance analysis for the seven tools on 1,049 representative Java open-source programs.