π AutoVerus Code: [Github Repo]
β‘ Verus-Bench Dataset: [Benchmark]
[2024/10/24] Our Code and Benchmark are released! Feel free to check them out over the links above!
[2024/9/19] Code and Benchmark are under an internal review process and will be released soon
AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human expertsβ three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors
To thoroughly evaluate AutoVerus and also to facilitate future research in AI-for-Verus, we have curated a benchmark suite that consists of 150 non-trivial proof tasks, with each task being a small Rust program associated with Verus specification. This is the first benchmark suite designed for Verus proof generation, and we refer to it as Verus-Bench. These 150 proof tasks are mainly translated from other benchmark suites (Diffy, MBPP, and CloverBench) into Rust/Verus
AutoVerus can automatically generate correct proof for more than 90% of benchmark tasks, with more than half of them tackled in less than 30 seconds or 3 LLM calls
Chenyuan Yang, cy54@illinois.edu
Xuheng Li, xuheng@cs.columbia.edu
Md Rakib Hossain Misu, mdrh@uci.edu
Shan Lu, shanlu@microsoft.com