In this section, we aim to provide several example tasks and corresponding model answers, in order to showcase an intuitive illustration of how the evaluation is conducted.
We will start with the example program, FizzBuzz, and its corresponding selection task. The program is shown in Fig. 1, where the JML specifications for this program are highlighted.
Fig. 1. Example program FizzBuzz, with one specification masked for the selection task.
It can be observed that one piece of the specification is masked with a placeholder (>>>INFILL<<<), which marks the place where the selected candidate should be inserted. LLMs are required to choose from the following four candidates the most appropriate one to fill in the placeholder:
A. //@ ensures (n % 3 != 0 && n % 5 == 0) <==> \result == 5;
B. //@ ensures (n % 3 == 0 && n % 5 == 0) <==> \result == 5;
C. //@ ensures \result == 5;
D. //@ ensures (n % 3 != 0 && n % 5 != 0) <==> \result == 3;
We first analyze which is the correct answer. By observing the program, we can see that fizzBuzz contains two distinct if-else branches. The first if statement is triggered if the input variable n is divisible by 3, and the second is triggered if n is divisible by 5. They restrict the return value of fizzBuzz to four distinct values: 0, 3, 5, and 8, depending on whether the n can be divided by 3 and 5.
If n is not divisible by either 3 or 5, the return value will be 0.
If n is divisible by 3, but not by 5, Â the return value will be 3.
If n is divisible by 5, but not by 3, the return value will be 5.
If n is divisible by both 3 and 5, the return value will be 8.
This description constitutes a full articulation of the program's behaviors. We can find that the JML specifications on lines 2, 3, 4, and 5 are simply formal statements for the descriptions above. The missing specification on line 4 should describe the case where n is divisible by 5 but not by 3. Naturally, the most appropriate answer is A. Candidate B is refuted because the divisibility by both 3 and 5 leads to a return value of 8. Candidate D is refuted for a similar reason. Candidate C is refuted because the return value is not bound to be 5 for all possible input n.
In this task, GPT-4 selected A, Llama selected B, GPT-3.5 selected C, and Deepseek-Coder along with Magicoder selected D. CodeLlama did not select any of the options but stated its own specifications, which is not in keeping with the task requirements. This makes GPT-4 the only subject LLM that completed this task with the correct answer.
We continue to investigate the infilling task for the example program, FizzBuzz. The program is shown in Fig. 2, where the JML specifications for this program are highlighted.
Fig. 2. Example program FizzBuzz, with one specification partially masked for the infilling task.
It can be observed that the specification on line 4 is partially masked with a placeholder (>>>INFILL<<<), which marks the subexpression that requires subject LLMs to fill in within the infilling task. As analyzed before, the specification on line 4 describes the case where the input variable n is divisible by 5, but not by 3. Under this case, the return value of function fizzBuzz will be 5. Consequently, the subexpression infilled should be \result == 5.
In this task, Most subject LLMs sorted out the correct answer and filled in the placeholder with the correct subexpression, except CodeLlama and Llama. The answer of CodeLlama is n%15==0, indicating that it incorrectly reasoned about the program's behaviors. The answer of Llama is n/2, which is completely irrelevant to the program's semantics.
Finally, We investigate the generation task for the example program, FizzBuzz, but in its semantic-preserving perturbed versions. Specifically, we study the results generated by GPT-3.5 for three different versions of this program, as is shown in Fig. 3, where the JML specifications generated are highlighted.
Fig. 3. Three different versions for program FizzBuzz, including the specifications generated by GPT-3.5.
The original version and its corresponding specifications generated are shown in Fig. 3. (a), where GPT-3.5 successfully reasoned about all possible situations for variable n and the corresponding return value of the function. However, when it comes to the Def-use Break perturbed version, GPT-3.5 failed to specify the overall functionality of the program. It kept articulating specific constraints on the return value, whereas it should determine the return value based on the input variable n, leading to no correct specifications generated. A similar phenomenon can be found in the Name Shuffle perturbed version, where GPT-3.5 failed to effectively differentiate cases based on input variable values when the input variable identifier is shuffled to res rather than n. Both of the semantic-preserving perturbations involved disrupt the variable identifiers, which proves to be influential among the perturbations adopted in our experiments. Since identifiers can carry semantic information, perturbations disrupting them can pose a greater impact by limiting the channel through which models learn program semantics.