Experiment Results
This paper shows the more experiment results and our analysis.
This paper shows the more experiment results and our analysis.
We select five existing tools for conducting our experiment: UAutomizer, CPAchecker, 2LS, AProVE, and T2. Among them, UAutomizer won the first prize in the termination category from SV-COMP 2017 to SV-COMP 2021. CPAchecker and 2LS respectively won the silver and bronze for the termination category in SV-COMP 2020 and SV-COMP 2021. The performance of AProVE is demonstrated in the annual international competition of Termination Tools, and AProVE got the top three form SV-COMP 2017 to SV-COMP 2019. T2 is powerful and more successful than Julia and TNT. Therefore, our selected five tools are representative tools for termination analysis. The versions of these tools we used in the experiments are: UAutomizer (0.2.2, provided version for SV-COMP 2022), CPAchecker (2.0), 2LS (0.9.5), AProVE (provided version for SV-COMP 2022), and T2 (2016 version). Note that, we tested these five tools with the timeout being 900 seconds.
We evaluate our benchmark on the state-of-the-art termination analysis tools, however, these tools cannot directly work on a part of special variable types and functions in real-world projects. Therefore, we firstly replace these special variable types with their supported types and rewrite the functions with the form that can be supported in the termination analysis tools, eg, file type variables are replaced by array. We find that the five tools all cannot work on C++ benchmark but can work on C benchmark. Therefore, 14 recursion benchmarks involving C++ characteristics (eg, class, inheritance) cannot be accepted by these five tools. In conclude, we test 100 benchmarks in C programs in total, and 90 of them are loops, the others are recursions.
Table 1 shows the results of the five tools in detail based on our constructed benchmark. The first and ninth columns represent the leaf category of non-termination bugs we have classified, the second and tenth columns refer to the number of bugs in the corresponding category. The termination results are shown in the third and eleventh columns for these five tools. The green background means the tool handled our benchmark correctly; the red background means the tool handled our benchmark uncorrectly; the yellow background means the tool cannot handled our benchmark and output "UNKNOWN".
The picture on the left shows the comparison results of the termination analysis tools on the existing benchmark and ours. Table 2 shows the statistial results of five state-of-the-art tools on our benchmarks and existing benchmarks.
As is shown in table 1, the capabilities of these five tools to make correct verdicts drop compared with existing benchmarks. For example, UAutomizer can solve 47 (47%) programs correctly. As a comparison, 71.47% (1581/2212) programs were handled correctly in SV-COMP 2021.
Besides, all of five tools prefer to answer "UNKNOWN'', which indicates the complexity of our benchmark is more than that in existing benchmarks. The dropping in accuracy and the preference to answer ``UNKNOWN'' involve two main reasons:
During our benchmark extraction, we set the precondition of these programs to be non-deterministic, which causes an over-approximation to their real preconditions and make termination analysis more complex.
The benchmark extracted from real-world projects may contain complex computation (eg, bit calculation, pointer manipulation), various data structures (eg, arrays, linked list) and data type (eg, size_t).
Third, we noticed that these tools fail to work on real-world project directly and perform worse on our real-world benchmark than on existing well-established benchmarks. Therefore, existing termination analysis tools should be improved in terms of their scalability and applicability on real-world projects.
We further analyze these unhandled bugs to explore the weaknesses of these existing tools. Table 3 shows the results that the tools cannot handle benchmarks involving special features. The first column represents the special features that can significantly affect the results of existing tools. The second column refers to the results that the tools analyze these programs (UN is ``UNKNOWN'', W means ``Wrong'').
(1) All of these special features in the first column can make the termination analysis tools prefer to answer ``UNKNOWN''. However, these special features are commonly used in real-world projects, which indicates that the state-of-the-art tools are inefficient and immature in analyzing the termination of real-world projects.
(2) Some features can affect the soundness of these tools (ie, wrong answers), which should be paid more attention. It is mainly because some programming errors caused by Overflow, Type, and Bit Calculation, which can change the ideal execution of programs, are not well considered.
(3) General programming errors, such as overflows, implicit casts, and incorrect variable types, should be paid more attention in future termination analysis research. Supporting complex language features and data structures can also help to improve the performance of termination analysis tools in practice.