Rebuttal

Source Code: https://anonymous.4open.science/r/IntelliRadar-C320

Collected Malicious Package Intelligence: https://intelliradar.netlify.app/

Question1： Which is the exact number of webpages not reporting malicious package names in the the sample of N=28,593 pages?

Among the 28,593 filtered webpages, 11,173 (39.1%) contained no malicious package information, making chance accuracy meaningless for our extraction task. To validate our system's precision, we randomly sampled 200 webpages where our LLM extracted no malicious packages and manually verified that none actually contained malicious package information. This high accuracy is achieved through our three-stage Chain-of-Thought approach, where each stage (entity extraction, relationship analysis, and verification) includes validation mechanisms to accurately determine whether a webpage contains malicious package information. Combined with comprehensive source filtering and cross-source validation, our results demonstrate superior coverage and timeliness compared to existing databases. While our current system already significantly enhances downstream security, future work will further expand intelligence sources and improve extraction accuracy.

Randomly Selected 200 Analysis Samples:

https://drive.google.com/file/d/1PoA-lz1NDv9I6Y7hXz4EzGkQ3rtZT-DM/view?usp=drive_link

Question2： Why does GPT-4o perform much better than other LLMs? Does it still perform well on more recently published security reports?

GPT-4o demonstrates superior performance due to its stronger contextual reasoning capabilities and better handling of complex texts containing mixed benign and malicious package information, consistent with existing analyses (https://www.vellum.ai/blog/llama-3-3-70b-vs-gpt-4o).

We have extended to test the LLMs on 50 randomly-selected identified intelligence webpages published in 2025, the results indicate that GPT-4o still outperforms other LLMs, while LLaMa3.1-70B and LLaMa3.3-70B perform much better, especially on recall, in the latest cases. We manually inspected these intelligence webpages and found that this is because there is less content in the new 50 intelligences. The detailed results are as follows:

Randomly Selected 50 Analysis Samples:

https://drive.google.com/file/d/1ZfYC2T0mmy0rVQPDEY4EAcaKwO2ERpOE/view?usp=drive_link

Question3： Accuracy of other entities

Thanks for raising this. Initially, we only evaluated the accuracy of package names and versions (the most critical information for downstream alerts) in the evaluation, because many entities, such as Method of Attacks, Attack Vectors, and Indicators of Compromises, are usually textual and difficult to automatically evaluate, and require manual analysis, for all results generated by LLMs. To address this concern, we manually evaluated the accuracy of other entities extracted by IntelliRadar. The results show that, though a bit less accurate compared to package names and versions, IntelliRadar still achieve reasonable accuracies. For instance, the F1-score of Data of Discovery, Repository Urls, and Discoverers are 88.89%, 63.64% and 87.32%.

The performance results reveal distinct patterns aligned with entity complexity. Explicit entities like Date of Discovery (88.89% F1), Discoverer (87.32% F1), and IOC (87.10% F1) achieve high performance due to their standardized formats and surface-level presentation in security reports. In contrast, entities requiring contextual inference—Method of Attack (76.92% F1) and Attack Vector (73.47% F1)—show lower performance as they demand semantic understanding of cybersecurity concepts rather than pattern recognition. The poor performance of Repo URLs (63.64% F1) is particularly notable, with low precision (53.85%) primarily caused by systematic misclassification of IOCs as repository URLs due to their structural similarity in domains, paths, and hash-like patterns.

Question4： Source code of malicious packages

This dataset contains comprehensive malicious package intelligence, including not only package names but also their corresponding detailed entities and metadata. We have curated a collection of identified malicious packages with complete attribution information. Here we present 3,000 randomly selected malicious packages, with the complete dataset to be made publicly available in the future.

Randomly Selected 3,000 source code files:

password : malware1011

https://drive.google.com/file/d/1-SLPTHCsg2WmKYEOIV6yB1CrzpBBqlkt/view?usp=sharing

Question5： Example of E definitions

We provide a concrete example using the malicious package "@fixedwidthtable/fixedwidthtable" to illustrate our entity extraction methodology.

Entity Definition Framework: Our system extracts a comprehensive Entity Information Set E = {N, V, F, R, M, D, I, A, C, T} for each identified malicious package.

Case Study: "@fixedwidthtable/fixedwidthtable"

Question6： Case study of delay malicious package ?

We provide a case study of a malicious PyPI package named pycryptdome, which has not yet been reported by OSVDB, Snyk, or the GitHub Advisory Database (see Table 4 and details on our website). This delay likely arises because OSV primarily monitors structured vulnerability data, whereas Snyk relies on manual expert inspection of scattered webpages, resulting in gaps or reporting delay.

Soruce Code:

Question7：Details of entity extraction, relation extraction, and verification?

Stage 1: Entity Extraction Details

Answer: Our entity extraction process uses a structured prompt design with the following methodology:

Input Components:

Original text content
Pre-extracted potential malicious package names
Task-specific extraction prompts

Process Framework: We provide GPT-4o with five key components:

Task Description: Clear specification for analyzing malicious package intelligence
Entity Definitions: Target entities including package name, version, discovery date, repository URL, attack method, discoverer, affected systems, attack vector, and IOCs
Attention Guidance: Pre-filtered potential package names from stage 2 to focus LLM attention
Pattern Recognition: Emphasis on malicious package naming patterns (typosquatting, misspellings)
Few-shot Learning: Representative examples for extraction guidance

Stage 2: Relation Extraction Methodology

Answer: The relationship analysis stage adopts a package-centric organization approach:

Input: Stage 1 extracted entities + original text + relationship analysis prompts

Process:

Semantic Analysis: LLM analyzes relationships between extracted entities to determine package association
Information Aggregation: Consolidates scattered entity information by package name to form complete intelligence records
Validation: Verifies entity relationships (version-package correspondence, timestamp consistency)
Output Generation: Creates structured JSON objects containing all relevant information for each specific package

Stage 3: Verification Process and Experimental Details

Answer: We implemented an iterative prompt optimization methodology:

Experimental Setup:

Controlled experiments on webpages with confirmed malicious package intelligence
Systematic comparison of LLM extraction results against manually annotated ground truth
Identification and analysis of missed entities and incorrect extractions

Optimization Process: Based on experimental findings, we implemented two major enhancements:

Structured Output Formats: Added case studies and examples for extraction guidance
Rule Enhancement: Updated IMPORTANT notes and EXTRACTION RULES sections targeting identified error patterns

Final Implementation: The optimized prompt components directly incorporate lessons learned from this iterative refinement process, ensuring improved accuracy and consistency.

Page updated

Google Sites

Report abuse