A Comprehensive Study on Quality Assurance Tools for Java
Accepted by ISSTA-2023
Accepted by ISSTA-2023
Introduction
Quality assurance (QA) tools are receiving more and more attention and are widely used by developers. Given the wide range of solutions for QA technology, it is still a question of evaluating QA tools. Most existing research is limited in the following ways: (i) They compare tools without considering scanning rule analysis. (ii) They disagree on the effectiveness of tools due to the study methodology and benchmark dataset. (iii) They do not separatelyanalyze the role of the warnings. (iv) There is no large-scale study on the analysis of time performance. To address these problems, in the paper, we systematically select 6 free or open-source tools for a comprehensive study from a list of 148 existing Java QA tools. To carry out a comprehensive study and evaluate tools in multi-level dimensions, we first mapped the scanning rules to the CWE and analyze the coverage and granularity of the scanning rules. Then we conducted an experiment on 5 benchmarks, including 1,425 bugs, to investigate the effectiveness of these tools. Furthermore, we took substantial effort to investigate the effectiveness of warnings by comparing the real labeled bugs with the warnings and investigating their role in bug detection. Finally, we assessed these tools' time performance on 1,049 projects. The useful findings based on our comprehensive study can help developers improve their tools and provide users with suggestions for selecting QA tools.
Overview
We conducted a benchmark experiment of 6 QA tools on 1,425 bugs from 5 benchmarks and a large-scale experiment of these tools on 1,049 projects. Our comprehensive study is the largest study on QA tools ever (i.e., 6 × (1, 425 × 2 + 1, 049) = 23, 394 scanning tasks). We spent over 4 months preparing and executing these projects.
We spent 7.5 person-month mapping 1,813 scanning rules of 6 tools to CWE and 311 detected bugs by these tools to CWE. To the best of our knowledge, this is the first work that constructs a connection between references, rules, tools, and datasets.
We evaluated the selected QA tools from multi-level dimensions, including coverage and granularity of scanning rules, the effectiveness of tools, the effectiveness of warnings, and time performance. Our evaluation implements a systematic and comprehensive comparison of QA tools from scanning foundation to scanning results and scanning expense.
Research Questions
RQ1: To what extent do scanning rules cover different bugs, and how do the granularity of scanning rules vary by tools?
RQ2: To what extent can QA tools detect bugs from a diversity of benchmarks?
RQ3: How effective are the warnings reported by the QA tools?
RQ4: What is the time performance of the QA tools?
The Combination of Tools
We have combined the result of all six tool. As shown in the following figure, the combination of SonarQube and PMD can totally detect 213 bugs. As for 3 tools, Error Prone, PMD and SonarQube can detect 269 bugs. When it comes to 4 tools, SpotBugs can add detection of 302 bugs. Finally, when it comes to 5 tools, Infer can add detection of 8 bugs.
We present the Venn diagram of the tool detection results as follows.
Study Data