How effective are current tools in identifying affected versions of vulnerabilities?
As we are the first study to evaluate matching-based methods, we formulated RQ1 to assess both types of methods, aiming to explore their differences in the task of identifying vulnerability-affected versions. For a more comprehensive evaluation of both methods, we constructed two evaluation dimensions: one at the vulnerability-level and the other at the version-level.
For Vulnerability-level,A prediction is considered correct (TP) only if it exactly matches the ground truth (no missing or extra versions). We also define a no-miss case as one that contains all ground-truth versions, despite the extra ones. This distinction reflects practical trade-offs: missing affected versions may introduce security risks, while over-reporting primarily increases maintenance overhead.
For Version-level, each version is evaluated independently to capture partial correctness and better reflect a tool’s generalization ability.
The evaluation metrics are as follows:
Accuracy = TP/Total
No-Missing-Ratio = No-Miss/Total
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1 = 2*(Precision*Recall)/(Precision+Recall)
We found:
Existing tools remain limited in accurately inferring affected versions. The best tool achieves under 50.0% accuracy at the vulnerability level, with over 30.0% of vulnerabilities missing at least one affected version.
Existing tools cannot precisely identify the full affected versions of vulnerabilities (with no version omissions or false positives). However, they achieve an higher F1-scores in approximating the affected versions.
While tracing-based tools outperform matching-based ones on average, several matching-based tools achieve comparable F1 scores at the version level and often outperform in precision.
What are the primary causes of FPs and FNs produced by existing tools?
To understand the root causes of FPs and FNs in vulnerability-affected version identification, we adopt a mixed-method approach combining qualitative analysis and quantitative validation. We first systematically examine the key technical strategies used in tracing-based and matchingbased methods, analyzing potential limitations at each stage. To validate these observations, we randomly sample 100 vulnerabilities from our dataset and evaluate identification results under representative strategies.
How does identification performance vary across different patch types?
While RQ1 and RQ2 assess tool effectiveness from overall and stage-specific perspectives, they leave open the question of how structural properties of vulnerability patches influence performance. Thus, we analyze tool robustness across three patch-level dimensions, based on their direct alignment with the internal assumptions commonly made by existing tracingand matching-based methods. The insights also support the design and evaluation of ensemble configurations in RQ4.
Type of Code Changes. As observed in RQ2, most tools rely heavily on deleted lines for tracing. We therefore classify patches into Add-only, Del-only, and Mixed types, to evaluate sensitivity of this structural dependency.
Scope of Modifications. Tools typically treat all modifications uniformly, regardless of how code changes are localized. However, real-world patches may cantain a single function, span multiple functions within a file, or modify code across files. This dimension helps assess each tool’s resilience to dispersed or concentrated changes.
Cross-Branch Context. In multi-branch development workflows, the same vulnerability may be patched differently across branches. Although V-SZZ incorporates cross-branch information when identifying affected versions, the general impact of single-branch versus multi-branch patches on tool performance has not been fully explored/determined. While V-SZZ incorporates cross-branch information when identifying affected versions, the general impact of singlebranch versus multi-branch patches on tool performance remains underexplored.
Based on the above design, we evaluate tools along with these three patch dimensions to understand their structural robustness and uncover potential blind spots.
Can combining existing tools improve the overall effectiveness?
Based on earlier findings, this RQ investigates whether combining tool components or outputs can yield performance improvements over standalone tools. We explore this from two angles: modular recomposition of tracing-based tools and ensemble strategies spanning tracing- and matching-based tools.
Our evaluation is organized into two phases:
Phase-1: Modular Recomposition of Tracing-based Tools. We focus on tracing-based tools because their modular workflows are amenable to recombination, whereas the core stages of matching-based tools are often tightly coupled and less separable. Based on the stage-wise decomposition in RQ2, we identify four stages: (S1) statement selection, (S2) impact range inference, (S3) commit tracing, and (S4) cross-branch patch reuse. Prior results show that LLM-based methods in S1 and patch propagation in S4 consistently outperform alternatives. Fixing these two stages, we systematically explore combinations of 2 alternatives in S2 and 4 in S3, yielding 7 hybrid configurations (2 × 4 − 1 = 7). The best variant (LLM4SZZ+) serves as a representative for Phase-2.
Phase-2: Cross-Tool Combination. This phase investigates whether outputs from diverse tools—spanning both tracingand matching-based paradigms—can be effectively integrated. We select ten high-performing tools (F1-score ≥ 70%): VCCFinder, V-SZZ, Lifetime, SEM-SZZ, LLM4SZZ, ReDebug, Movery, FIRE, V1SCAN, and the newly derived LLM4SZZ+. We evaluate three ensemble strategies as follows:
Inclusion Strategy. The cumulative effect of integrating tools is evaluated by testing sizes 2 to 10 union sets.
Voting Strategy. To assess consensus-based robustness, we evaluate all combinations of 3, 5, 7, and 9 tools, marking a version as affected if the majority agrees.
Best-in-Dimension Strategy. We also select the best performing tool in each of four key dimensions identified in RQ3 (e.g., patch modeling, commit tracing) and aggregate their outputs, leveraging their complementary strengths.
Can combining existing tools improve the overall effectiveness?
Based on earlier findings, this RQ investigates whether combining tool components or outputs can yield performance improvements over standalone tools. We explore this from two angles: modular recomposition of tracing-based tools and ensemble strategies spanning tracing- and matching-based tools.
Our evaluation is organized into two phases:
Phase-1: Modular Recomposition of Tracing-based Tools. We focus on tracing-based tools because their modular workflows are amenable to recombination, whereas the core stages of matching-based tools are often tightly coupled and less separable. Based on the stage-wise decomposition in RQ2, we identify four stages: (S1) statement selection, (S2) impact range inference, (S3) commit tracing, and (S4) cross-branch patch reuse. Prior results show that LLM-based methods in S1 and patch propagation in S4 consistently outperform alternatives. Fixing these two stages, we systematically explore combinations of 2 alternatives in S2 and 4 in S3, yielding 7 hybrid configurations (2 × 4 − 1 = 7). The best variant (LLM4SZZ+) serves as a representative for Phase-2.
Phase-2: Cross-Tool Combination. This phase investigates whether outputs from diverse tools—spanning both tracingand matching-based paradigms—can be effectively integrated. We select ten high-performing tools (F1-score ≥ 70%): VCCFinder, V-SZZ, Lifetime, SEM-SZZ, LLM4SZZ, ReDebug, Movery, FIRE, V1SCAN, and the newly derived LLM4SZZ+. We evaluate three ensemble strategies as follows:
Inclusion Strategy. The cumulative effect of integrating tools is evaluated by testing sizes 2 to 10 union sets.
Voting Strategy. To assess consensus-based robustness, we evaluate all combinations of 3, 5, 7, and 9 tools, marking a version as affected if the majority agrees.
Best-in-Dimension Strategy. We also select the best performing tool in each of four key dimensions identified in RQ3 (e.g., patch modeling, commit tracing) and aggregate their outputs, leveraging their complementary strengths.
The Detail of Inclusion Strategy:
The Detail of Voting Strategy:
The Detail of Best-in-Dimension Strategy(detail strategies are described in paper):