This page will investigate several example programs and corresponding specifications, in an analysis of the strengths and weaknesses of SpecGen. The detailed results of SpecGen, Houdini, and Daikon for all test cases can be accessed here.
SpecGen shows an impressive ability in handling programs containing strings and arrays, benefiting from the code comprehension abilities of LLMs to sort out the relations between specific elements within, eventually generating targeted specifications according to the possible functionalities of the input program. We first investigate an example program, TwoSum, that is uniquely handled by SpecGen, to illustrate the workflow of SpecGen in more detail. The program searches in the given integer array for the indexes of two separated elements of which the sum is exactly the target value, and is implemented in two nested loops. The specifications generated by SpecGen for this program are shown in the figure below. We can observe that the post-conditions on line 5 and the loop invariants on line 9 possess two nested \forall expressions due to the nested loop structure, which is too complicated for Houdini and Daikon to cover. In this test case, the results of both Houdini and Daikon fail to pass verification. Due to the flaws within the workflow of Houdini, the results of Houdini contain a tremendously large set of unfiltered candidates and cannot be presented on this page. We have uploaded the results of Houdini and the results of Daikon on this test case onto our artifact.
In comparison, benefiting from the code comprehension ability of LLM, our approach successfully generates the appropriate template specifications with the correct nested \forall structure. However, it is difficult for LLM to accurately capture the subtle comparison relations between variables within the loop invariants. For instance, the expression k + 1 <= j && j < n inside the loop invariant on line 9 can easily be mistakenly written as k <= j && j < n, and i < k && k < j on line 14 can be incorrectly written as i <= k && k < j. Both mistakes will result in a verification failure. These errors are on so fine a granularity, that even human experts encountering them will have to stop and think for a while before sorting out the correct relations. Through the mutation-based generation process, SpecGen can fix such errors through mutations on comparison operators and eventually produce verifiable results.
We further take a few examples to illustrate the strengths of our method and the weaknesses of Houdini and Daikon. The example program shown in the figure below, Absolute, is a program calculating the absolute value of a given integer. The specifications generated by Houdini and Daikon are shown in the figure below. It is worth noting that Houdini generated more specifications, but most of the results such as \result > -1 are just weaker duplicates of \result >= 0. This is because Houdini has no logic to suppress the weaker and equivalent specifications, whereas Daikon does. Therefore, the effective results of Houdini and Daikon are both \result >= num, which is correct but not enough to fully describe the behavior of the function. Consequently, the specifications generated by Houdini and Daikon received average ratings of 3.50 and 3.36 respectively. In comparison, the specifications generated by SpecGen and manually written ground truth are also presented in the figure below. We can observe that the specifications generated by SpecGen can articulate the constraints satisfied by the return value depending on the conditions satisfied by the given integer num, which is similar to what the ground truth specifications do. Consequently, the result of SpecGen received an average rating of 4.85, which is close to the 5.00 rating of ground truth.
The next example, isPalindrome, is a program deciding whether a given string is a palindrome. Similar to the previous case, specifications generated by Daikon (can be accessed here) failed to pass verification. As for Houdini, for the method isPalindrome as a whole, Houdini generated some trivial specifications about the nullness of the input string. For the loop within the method, Houdini generated some simple loop invariants about the numerical constraints on variable n, such as n >= 0. It is obvious that such specifications are too trivial to articulate the real behaviors of the program. In comparison, the specifications generated by SpecGen are much stronger. For the method isPalindrome, SpecGen succeeds in describing the definition of a palindrome string with a \forall statement and fully specifies the function of this method as a whole. For the loop within the method, SpecGen successfully describes with another \forall statement that part of the input string that has already been checked satisfies the definition of palindrome string. These specifications are equivalent to the ground truth, which is also shown in the figure below.
We are going to illustrate the limitations of LLMs through the following example, ComputeArea, which is intended to compute the area covered by two rectangles. Each rectangle is determined by the coordinates of their vertices on top right and bottom left. Although the program is loop-free, enormous variables and massive non-linear operations are involved, resulting in an excessive post-condition on line 6. LLMs can hardly sort out the proper specification due to their limited reasoning ability. Consequently, SpecGen cannot generate the correct specifications for this program in our experiment. It is not surprising that both Houdini and Daikon failed to generate specifications in this test case as well. Houdini crashes due to the massive amount of candidates generated, which is brought by the enormous number of variables. Daikon generates no specifications because its templates cannot cover specifications as complicated as such.
We are going to illustrate a type of program that cannot be effectively handled by SpecGen but can be handled by non-LLM approaches with the last example, ExSymExe28. The program is collected from the "jpf-regression" directory of the java category of the SV-COMP benchmark. Programs in the directory feature several if-else branches, but with some "fake" branches that have no actual effect on the output. For example, in ExSymExe28 shown below, there are two if-else branches, with the second being a "fake" branch that affects neither variable values nor the return value. Through our experiment, we speculate that LLMs can frequently be confused by these fake branches, consequently generating incorrect specifications. As shown below, the result of SpecGen indicates that the LLM took the second branch into account, where it shouldn't. In comparison, traditional approaches like Daikon handled the program successfully since they cannot be confused by such tricks, and the actual variable relations involved are not complicated.