The detailed statistics of this RQ are listed in the table below. The table shows the performance of each evaluated method on all test cases. The metric for LLM-based methods is the success probability, whereas it is a single state of success/failure for Houdini and Daikon. We marked success probabilities in different intervals with different colors: green (80%~100%), yellow (50%~80%), orange (1%~50%), and red (failed).
In summary, SpecGen significantly outperforms other baseline methods in terms of verifiable specification generation. This is due to the code comprehension ability of LLMs, and the error-correcting ability of mutation-based specification generation process.