RQ3: Can LLM understand code dynamic behaviors?

Finally, we evaluated LLM's comprehension of code execution time. This assessment is crucial, as static analysis serves as a fundamental approximation of software behavior. Many errors stem from runtime issues. Given that a software's dynamic behavior constitutes the essence of its execution time, mutation testing[1] was our natural choice. Each non-equivalent mutant exhibits a distinct behavior compared to the original code under specific input conditions, despite being remarkably similar due to minor modifications. The concept of a flaky test[2] pertains to the varying behavior of the same code at different execution times, serving as a complement to code change analysis. Alghough LLM can assign the label and explaination, they are not reliable. It is still challenging for the foundation models to predict the code dynamic behavior. The related experimental data link is here.

[1] Muation Testing https://en.wikipedia.org/wiki/Mutation_testing

[2] Flaky Test, https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html

Equivalent Mutant with few-shot

Here's an interesting example when we use codellama to do this task. Even if we prompt the codellama that the two codes are semantically equal, the model still output 'no'. Our input is following:

"""Please analyze the two following provided code files in C or Java. Identify if they are semantically equal. 'Semantically equal' means two codes have the same meaning, that they have the same output given the same input.

Here are three semantically equal examples:

The first example pair is

``` Code 1

double f(double M, double x) {

x = (M + x) / 2;

return x;

}

```

``` Mutant Code 1

double f(double M, double x) {

x = (M + x++ ) / 2;

return x;

}

```

Yes. The two codes are semantically euqal because `M + x++` first does `M + x` and then `x++`. Therefore, `(M + x) / 2` is the same with `(M + x++) / 2`.

The second example pair is

``` Code 2

double f(int x, int divisor){

return x*divisor;

}

```

```Mutant Code 2

double f(int x, int divisor){

return x*divisor++;

}

```

Yes. The two codes are semantically euqal because `x*divisor` first does `x*divisor` and then `divisor++`. The two functions return the same values.

The third example pair is

``` Code 3

int f(int a, int y, int x){

int p = a;

p = x + y;

return p;

}

```

``` Mutant Code 3

int f(int a, int y, int x){

int p = 1;

p = x + y;

return p;

}

```

Yes. The two codes are semantically euqal because the local value `p` is re-assigned by `x+y`. There, the change `int p = 1;` will not affect the function.

Please identify if the two following codes are semantically equal. Please only answer `yes` or `no`. `yes` means they are semantically equal. `no` means they are not.

Input :

```Code

int main(int a, int b, int c) {

int mid;

if (a < b) {

if (c < b) {

if (a < c)

mid = c;

else

mid = a;

} else

mid = b;

} else {

if (c > b) {

if (a > c)

mid = c;

else

mid = a;

} else

mid = b;

}

return mid;

}

```

```Mutant Code

int main( int a, int b, int c )

{

int mid;

if (a < b)

{

if (c < b)

{

if (a < c)

mid = c;

else

mid = a;

}

else

mid = b; }

else

{

if (c > b)

{

if (a > c)

mid = c;

else

mid = a--;

}

else

mid = b; }

return mid;}

```

The difference patch is

```

@@ -22 +22 @@

- mid = a;

+ mid = a--;

Actually, the value we return is `mid`, and `mid=a` and `mid=a--` will have the same return value. Therefore, this example is semantically equal.

```"""

And here's the output:

"`no`, the codes differ in the line `mid=a--`"

Equivalent Mutant with zero-shot

We also find the same problem as the few-shot when we use CodeLlama.

The output is " No. They are different. In the original code, `mid` will be assigned the value of `a` if `c` > `b` but ` a` < ` c`. In mutant code ` mid` will be assigned the value of ` a --` if c > b but if a < C. Therefore the output of the mutated code will differ from the orignal code."

Flaky Test Reasoning with few-shot

We notice that when we use StarChat to do this task, few-shot's answer is better than zero-shot's. Both cases give the answer `async wait` most.

Here's an example which is apparently a IO problem, the model also give the `async wait` answer :

"""

private void createFileWithTestBytes(String filename) throws Exception {

float h=0.60701156;

try (final FileOutStream f = mFileSystem.createFile(new AlluxioURI(filename))) {

f.write(TEST_BYTES);

}

Assert.assertTrue(mFileSystem.exists(new AlluxioURI(filename)));

}

"""

Flaky Test Reasoning with zero-shot

It's interesting that when we use StarChat to do this task, many examples give us the same answer `async wait`.

Our code is following:

"""

public void bytesToHexShouldReturnDoubleLength() throws NoSuchAlgorithmException {

String y="station";

for (int q = 0; q < 20; q++) {

byte[] e = new byte[q];

SecureRandom.getInstanceStrong().nextBytes(e);

String x = Encoder.bytesToHex(e);

assertEquals(e.length * 2, x.length());

}

"""

When we use the few-shot, the model inference that it's a random problem, but in this case, it still output `async wait`.