Commit-Aware Mutation Testing

Commit-relevant mutants (GT) and mutation analysis (RQ2 and RQ3)

In this page we describe our experiments related to an envisioned use case, called commit-aware mutation testing. More precisely, we study the effectiveness of inferred delta specs for testing program behaviors affected by committed changes.

We start by running the mutation analysis on the expressible true positive cases previously identified in RQ1, for which some delta-added and preserved assertions were inferred (since the objective is to compare the effectiveness of these sets).

In total, we consider 11 commits for which DeltaSpec effectively produced a non-empty delta specification, including 12 classes, and generate a total of 3716 mutants (with PiTest), 1045 of which are commit-relevant. Note that the generated suites used for mutation analysis can kill 911 out of the 1045 relevant mutants, leading to a 87.2% of rMS.

Particularly, we use the publicly available dataset provided by Ojdanic et al. to determine which mutants are commit-relevant or not. For the bug-introducing and bug-fixing commits that are not part this dataset, we approximate mutants’ label by considering that mutants located in the modified methods, which forms a probable sample of commit-relevant mutants. You can find the labels for each mutant in the folder subjects/commit-relevant-mutants.

We use mutants’ label to assess the effectiveness of DeltaSpec in testing the change, that is, the ability of the inferred commit-relevant specifications for killing commit-relevant mutants.

If you want to re-run the mutation analysis, you can replace the invokation to run.sh by only-mutation-score.sh in the scripts run-collections-rq1.sh, run-lang-rq1.sh, and run-math-rq1.sh.

This script will setup and perform mutation testing on the cases where a non-empty delta spec was computed. At the end, the following files will be produced for each subject in the mutation folder:

added-inferred-on-mutants.csv and removed-inferred-on-mutants.csv (delta specifications detecting mutants, within the folder mutation/subject-id)
preserved-inferred-on-mutants.csv (non delta specifications detecting mutants, within the folder mutation/subject-id)

The following links contains the mutation analysis already computed:

Mutation analysis

Delta-specs vs non-delta-specs (RQ2)

In this experiment, we study the ability of commit-relevant specifications (delta specs) and preserved specifications (non delta specs) to identify artificially seeded faults relevant to the commit change. To do so, we took the specs produced by DeltaSpec in the executions produced above, and performed a controlled experiment as follows: we simulate a scenario where a tester selects the same number of assertions from the pool of delta-added and preserved assertions, includes them to a test suite and runs mutation testing to determine which mutantes are killed. Then, we compute the commit-relevant mutation score (rMS) for each set of selected assertions, to study if delta-added assertions or preserved assertions are more effective for testing the change. We consider different selection sizes (from 1 to 20) that are reasonable for manual analysis, and repeat the experiment 100 times (for each selection size) to avoid coincidental results. To perform this experiment, you can run the following commands:

$ ./run-collections-samples-rq2.sh

$ ./run-lang-samples-rq2.sh

$ ./run-math-samples-rq2.sh

$ ./run-confusion-matrices.sh

$ ./py_scripts/run-plots.sh

The first three scripts will perform the 100 simulations for the collections, lang and math subjects, respectively, determining which mutants are killed. The run-confusion-matrices.sh script will merge the executions results and produce a summary in the statistics folder, computing the rMS of each simulation. Finally, the run-plots.sh script will produce plots showing the results. For this experiment, the plot summarizing the results is the following:

ADDED vs PRESERVED.

We can observe that when only one assertion is selected, the delta-added specification obtains, on average, 30% of rMS while the preserved one obtains 2%. That is, the rMS obtained by the delta specification is, on average, 15 times more effective when only 1 assertion is selected. When 5 assertions are selected, delta-added obtains, on average, 49% of rMS while preserved obtains 12%, i.e., 4.08 times more effective. While delta-added obtains, on average, 56% of rMS when 10 assertions are selected, the preserved specification obtains 18%, i.e., a 3.11 times improvement. Finally, when 20 assertions are selected, the improvement of delta-added over preserved is of 3.26X, since delta-added obtain 62% of rMS and preserved just 19%. Overall, when selecting less than 20 assertions, delta-added obtains 57.5% of rMS, a 3.5 times higher rMS than the preserved assertions (obtaining 16.5% of rMS). The differences between the rMS obtained by the sets of assertions are statistically significant.

To sum up, in this experiment we found that:

Commit-relevant specifications are, on average, 3.5 times more effective in finding commit-relevant mutants than specifications preserved by the change.

Comparing the number of selected assertions in order to reach same rMS (RQ3)

In our last experiment we perform a simulation where we draw assertions to be analyzed by developers, in order to measure the developer's cost to achieve a given rMS. Essentially, we measure how many assertions are needed (the developer’s cost) to reach the same rMS (effectiveness), when assertions are taken from the pool of added assertions or the delta, compared to when they are taken from the entire set of valid assertions of the post-commit version. More precisely, we run the experiment as follows: we simulate a scenario where the tester selects assertions to kill mutants until it achieves a given rMS. Basically, we study and compare the number of assertions to select from the set of delta-added assertions and the entire set of valid assertions of the post-commits, in order to reach same effectiveness (same rMS). Again, we repeat the experiment 100 times. To perform this experiment, you can execute the following commands:

$ ./run-collections-testing-effort-rq3.sh

$ ./run-lang-testing-effort-rq3.sh

$ ./run-math-testing-effort-rq3.sh

$ ./py_scripts_per_commit/run-effort-statistics.sh

The first three scripts will run the simulation for the collections, lang and math subjects, respectively, while the script run-effort-statistics.sh will summarize the executions and then produce a plot showing the results:

This plot shows the number of assertions selected from each pool, to achieve the same rMS. On average, to reach the same rMS we need to select 32 and 55 assertions from the pools of delta-added and post-commit spec assertions, respectively. This suggests that, if we guide the testing process by all the post-commit valid assertions rather than the commit-relevant ones, the effort (number of selected assertions) is increased by almost 72%, without gaining effectiveness (equivalently, selecting assertions from the commit-relevant pool rather than from the whole set of valid post-commit assertions reduces the effort by 58.2%, to achieve a same rMS). This results suggests that:

Selecting commit-relevant assertions can help in reducing by 58.2% the effort, preserving rMS,

compared to selecting assertions from the pool of valid assertions of the post-commit.

Page updated

Google Sites

Report abuse