Table of Content
To explore the potential of LLMs in reducing the manual effort required for result verification, we conducted experiments in two scenarios: confirming results in homologous function search and verifying third-party library vulnerability search results.
Baseline: We selected three advanced reasoning models as our evaluation models: OpenAI’s o3-mini , DeepSeek’s deepseekr1 , and Anthropic’s claude-3.7-sonnet-thinking. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT), serving as baselines for comparison.
Prompt Strategies: zero-shot, few-shot, chain-of-thought(CoT).
Homologous Function Verification
Setup: We sampled 54 queries from RQ3 and collected the top-10 retrievals produced by DeJina for each query, resulting in a dataset of 540 candidate function pairs. Each pair consists of one query function and one retrieved candidate. Among these, 347 pairs are positive (homologous), and 193 are negative. For each pair, we provided the decompiled code and disassembled code of both the query and candidate functions to the LLM and asked it to determine whether the two functions are homologous.
Results
LLMs can achieve up to 97.8% F1 score in the homologous function confirmation task, exhibiting strong potential in automating homologous function verification.
Vulnerable Function Verification
Setup: We similarly selected the top-10 search results from 54 vulnerability queries in RQ3, resulting in 540 function pairs. Among these, 272 are vulnerable (positive), and 268 are non-vulnerable (negative). For each pair, we provided the source code of the vulnerable function, the corresponding patch, and the decompiled and disassembled code of the candidate function to the LLM. The LLM was tasked with determining whether the candidate function contains the same vulnerability as indicated in the query.
Results
LLMs can achieve up to 86.7% F1 score in the vulnerability confirmation task, exhibiting strong potential in automating vulnerability verification.