Source Code: https://anonymous.4open.science/r/IntelliRadar-C320
Collected Malicious Package Intelligence: https://intelliradar.netlify.app/
MetaReview Comment B: The results on all 50K pages (and not 500), names of malicious packages found by the LLM that are in the regex and are in the file filtered by the regex (so when you already know them)
Thank you for this important clarification request. We have made available the complete extraction results from all 50K pages for full transparency. The detailed results from each extraction method - including regex extraction, dictionary filtering, and LLM Chain-of-Thought extraction - applied to the original webpage text can be viewed at: https://anonymous.4open.science/r/IntelliRadar-C320/Dataset/Result/analysis_data/
This dataset provides comprehensive visibility into what malicious package names were identified by each step, allowing for complete verification of our approach and results.
MetaReview Comment C: Include the clarification on the novelty and contributions of IntelliRadar with respect. CITK from the response letter to the introduction and related work sections
We appreciate the reviewer for pointing this out. We have clarified the novelty and contributions of IntelliRadar in the introduction, RQ1, and related work sections. IntelliRadar is a systematic end-to-end framework that identifies intelligence sources, sets up automated monitoring, and retrieves malicious package intelligence. Our work goes beyond traditional NER tasks in several key aspects: (1) We need to distinguish between malicious and benign package names within the same webpage, as pages often contain both legitimate packages and their typosquatting variants; (2) We must align package names with their corresponding versions and attack methods; (3) We require domain-specific understanding to identify which entities represent actual threats versus legitimate software mentions. Traditional approaches like CTIKG and SecBERT focus on syntactic entity and relation extraction without semantic understanding, and cannot fully address these domain-specific challenges that require contextual comprehension.
To demonstrate this distinction, we have added experimental comparisons with traditional NER methods including CTIKG and SecBERT in the revised paper. For fair comparison, we used GPT-4o with the same configuration as IntelliRadar for CTIKG experiments. As shown in Table 2, the results reveal significant performance gaps: CTIKG achieved only 20.22% F1-score due to its lack of explicit entity definitions and specific understanding of malicious package characteristics. SecBERT performed poorly with 0.20% F1-score, as it cannot accurately identify complete package name entities and incorrectly identified numerous irrelevant text spans as package names, resulting in extremely high false positives (13,340). In contrast, IntelliRadar achieved 94.87% F1-score by incorporating domain-specific prompting strategies and explicit malicious package definitions, demonstrating the necessity of our specialized approach for this task.
The complete experimental code and detailed results for these comparative experiments are available at:
https://anonymous.4open.science/r/IntelliRadar-C320/Codes/Experiment/NER/CTIKG/extracted_packages/
MetaReview Comment D: Clarify that the NER accuracy is only about package names and versions, and do additional manual analysis to evaluate the NER accuracy of other types of entities like Method of Attacks and Attack Vectors
Thanks for raising this. Initially, we only evaluated the accuracy of package names and versions (the most critical information for downstream alerts) in the evaluation, because many entities, such as Method of Attacks, Attack Vectors, and Indicators of Compromises, are usually textual and difficult to automatically evaluate, and require manual analysis, for all results generated by LLMs. To address this concern, we manually evaluated the accuracy of other entities extracted by IntelliRadar. The results show that, though a bit less accurate compared to package names and versions, IntelliRadar still achieve reasonable accuracies. For instance, the F1-score of Data of Discovery, Repository Urls, and Discoverers are 88.89%, 63.64% and 87.32%.
The performance results reveal distinct patterns aligned with entity complexity. Explicit entities like Date of Discovery (88.89% F1), Discoverer (87.32% F1), and IOC (87.10% F1) achieve high performance due to their standardized formats and surface-level presentation in security reports. In contrast, entities requiring contextual inference—Method of Attack (76.92% F1) and Attack Vector (73.47% F1)—show lower performance as they demand semantic understanding of cybersecurity concepts rather than pattern recognition. The poor performance of Repo URLs (63.64% F1) is particularly notable, with low precision (53.85%) primarily caused by systematic misclassification of IOCs as repository URLs due to their structural similarity in domains, paths, and hash-like patterns.
Result: https://anonymous.4open.science/r/IntelliRadar-C320/Codes/Experiment/localllm
MetaReview Comment E: In RQ1, explicitly state that Snyk, OSV, and GitHub Advisory are already included as the information sources of IntelliRadar, and the results in RQ1 mainly show how other information sources have contributed to the collection of vulnerabilities reported online.
Thank you for your constructive suggestion regarding RQ1. We have addressed this concern in our revision by clearly stating in Section 4.2 (Experimental Setup) that Snyk, OSV, and GitHub Advisory are included among our 24 intelligence sources to ensure comprehensive coverage. Section 4.4 (RQ2: Completeness) explains our rationale for including these existing databases: they contain malicious packages identified through internal detection tools and manual research that are not disclosed elsewhere. Additionally, in Section 4.4, we have clarified both the total scale of IntelliRadar's data and specifically highlighted the intelligence we collected from unstructured web content (sources beyond OSV, Snyk, and GitHub Advisory databases). This includes 17,759 NPM package names and 11,248 PyPI package names identified through LLM-based analysis, with 2,262 NPM package names and 2,566 PyPI package names that are not present in any existing structured databases.
To ensure full transparency and reproducibility, we have made the complete IntelliRadar dataset publicly available through:
Interactive Database: https://intelliradar.netlify.app/
All Unstrcuted Data: https://anonymous.4open.science/r/IntelliRadar-C320/Dataset/Json
Metareview Comment F: Include the new experiment on vulnerabilities reported after the release date of GPT-4.
GPT-4o demonstrates superior performance due to its stronger contextual reasoning capabilities and better handling of complex texts containing mixed benign and malicious package information, consistent with existing analyses (https://www.vellum.ai/blog/llama-3-3-70b-vs-gpt-4o).
We have extended to test the LLMs on 50 randomly-selected identified intelligence webpages published in 2025, the results indicate that GPT-4o still outperforms other LLMs, while LLaMa3.1-70B and LLaMa3.3-70B perform much better, especially on recall, in the latest cases. We manually inspected these intelligence webpages and found that this is because there is less content in the new 50 intelligences. The detailed results are as follows:
Randomly Selected 50 Analysis Samples:
https://drive.google.com/file/d/1OfUVJ0RpuOgUNZ4Pro9QawmfUU3aSxYD/view?usp=sharing
Result:
https://anonymous.4open.science/r/IntelliRadar-C320/Codes/Experiment/recent_webpage
Metareview Comment F: Please provide the process for the iterative refinement of the prompt. In your artifacts, provide the evolution of your prompts and how the results improved little by little. This will provide insight to the researchers in the relevant field on how to improve prompts incrementally.
Thank you for your constructive suggestion. We systematically refined our prompt design through six iterations to tackle the complex malicious package entity extraction task. The first version used basic single-step extraction requiring only JSON output without structured guidance. The second version added structured requirements with explicit field specifications. The third version incorporated examples of both incorrect and correct outputs to guide model behavior through contrastive cases. However, these early versions (V1-V3) all employed a one-step approach requiring only a single LLM call, which led to incomplete extraction and poor entity relationship alignment.
Starting from v4, we restructured the task into three sequential steps: (1) entity extraction, (2) relationship analysis, and (3) information verification. This modular design embodies the core principle of complex task decomposition, allowing each step to focus on specific aspects and significantly improving accuracy and completeness. V4 incorporated Chain-of-Thought reasoning by embedding reasoning chains in prompts to enhance the model's logical analysis capabilities. V5 introduced few-shot examples providing concrete input-output pattern demonstrations, leveraging examples as effective conditions for in-context learning. Finally, V6 combined CoT reasoning with few-shot learning, narrowing the capability gap in complex tasks through the integration of complementary prompting techniques.
The experimental results validate our iterative approach Table. The early versions (V1-V3) performed poorly due to simple single-query approaches lacking clear definitions and explanations of malicious packages, with F1-scores below 8%. Starting from V4, we introduced explicit entity definitions for each of the nine entity types and restructured the task into a multi-step approach with task decomposition, significantly improving performance. V4 (CoT only) and V5 (few-shot only) each demonstrated specific strengths: CoT achieved high recall (91.73%) with an F1-score of 93.10% through systematic reasoning, while few-shot maintained high precision (95.74%) but with lower recall (59.89%) resulting in an F1-score of 73.68\% through pattern demonstration. Our final method V6 achieved superior performance across all metrics: 97.91% precision, 92.01% recall, and 94.87% F1-score by combining both approaches. Most importantly, V6 dramatically reduced false positives to 14 (compared to 38 in CoT-only and 19 in few-shot-only) and maintained low false negatives (57), demonstrating that integrating CoT reasoning with few-shot examples creates synergistic effects addressing individual technique limitations. All prompt versions are available on our project website for reproducibility.
Prompt: https://drive.google.com/file/d/1CIHRoFuwC-Tydns6JTkMxGpFscTXm73E/view?usp=sharing
Result: https://anonymous.4open.science/r/IntelliRadar-C320/Codes/Experiment/ablcation_prompts/results
Reviewer A Comment 8: The work is logically reasonable, and the description is sufficient to replicate the research. However, I checked the website: https://sites.google.com/view/intelliradar/home?authuser=0 and I could not find the source code nor the complete sources of the 36 thousands of malicious packages.
Thank you for your constructive comment. We have made our code and data publicly available at the following locations:
Source Code: https://anonymous.4open.science/r/IntelliRadar-C320
Interactive Database: https://intelliradar.netlify.app/
These resources provide the complete source code and all 36,000 malicious packages for full replication of our research.
Metareview Comment N: Please provide a case study where their proposed approach finds packages that are malicious but not yet reported in Snyk or OSVDB.
We provide a case study of a malicious PyPI package named pycryptdome, which has not yet been reported by OSVDB, Snyk, or the GitHub Advisory Database (see Table 4 and details on our website). This delay likely arises because OSV primarily monitors structured vulnerability data, whereas Snyk relies on manual expert inspection of scattered webpages, resulting in gaps or reporting delay.
Source Code: https://drive.google.com/file/d/1vBM1R8b8fFTfE4Ba9sOAW4bgoOFWVtYg/view?usp=sharing
Reviewer C Comment 8: Which is the exact number of webpages not reporting malicious package names in the the sample of N=28,593 pages?
Among the 28,593 filtered webpages, 11,173 (39.1%) contained no malicious package information, making chance accuracy meaningless for our extraction task. To validate our system's precision, we randomly sampled 200 webpages where our LLM extracted no malicious packages and manually verified that none actually contained malicious package information. This high accuracy is achieved through our three-stage Chain-of-Thought approach, where each stage (entity extraction, relationship analysis, and verification) includes validation mechanisms to accurately determine whether a webpage contains malicious package information. Combined with comprehensive source filtering and cross-source validation, our results demonstrate superior coverage and timeliness compared to existing databases. While our current system already significantly enhances downstream security, future work will further expand intelligence sources and improve extraction accuracy.
Randomly Selected 200 Analysis Samples:
https://drive.google.com/file/d/18jL1drT0CoXA6i7M-Yc3QIwdAyEnGPBf/view?usp=sharing
Reviewer C Comment 10: There is no datatset whatsoever of malicious packages. There is a dataset of names of malicious packages (which is anyhow not available in the repository).
This dataset contains comprehensive malicious package intelligence, including not only package names but also their corresponding detailed entities and metadata. We have curated a collection of identified malicious packages with complete attribution information. Here we present 3,000 randomly selected malicious packages, with the complete dataset to be made publicly available in the future.
Randomly Selected 3,000 source code files:
password : malware1011
https://drive.google.com/file/d/1ETTUn0ptW2K565tejn37javohp5nC7Y7/view?usp=sharing