Source Code: https://anonymous.4open.science/r/IntelliRadar-C320
Collected Malicious Package Intelligence: https://intelliradar.netlify.app/
Abstract: Malicious packages in public registries have significantly threatened the software supply chain (SSC) security in modern software engineering. Current software component analysis (SCA) tools have been applied to detect and report these threats to protect downstream users. However, existing SCA tools rely heavily on databases of known malicious components, while existing databases such as OSV and Snyk often have delays in data updates and incomplete information. This inadequacy is particularly evident in their limited coverage of non-structured intelligence sources like social media platforms and developer forums, where emerging threats are often first reported. Consequently, the lifecycle of malicious packages is extended, posing serious threats to downstream users.
To address this, we developed a novel and comprehensive approach to construct a platform IntelliRadar to pinpoint the collection, processing, and extraction of malicious components intelligence. Specifically, by exhaustively searching and snowballing the public sources of malicious package intelligence, and incorporating large language models (LLMs) with domain-specialized Least to Most prompts, IntelliRadar ensures the coverage, timeliness, and accurate information extraction of malicious package intelligence. As a result, we constructed a comprehensive malicious package database containing 34,313 malicious NPM and PyPI packages. Our evaluation shows that IntelliRadar achieves high performance (97.91% precision) on malicious package intelligence extraction. Compared to existing databases, IntelliRadar identifies 7,542 more malicious packages than OSV and 12,684 more than Snyk. Furthermore, 76.6% of NPM components and 70.3% of PyPI components in IntelliRadar were collected earlier than in Snyk's database. IntelliRadar is also more cost-efficient, with a cost of $0.003 per piece of malicious package intelligence and only $7 per month for continuous monitoring. Furthermore, we identified and received confirmation for 1,981 malicious packages in downstream package manager mirror registries through the implementation of IntelliRadar.
Workflow of the IntelliRadar
To confirm the shortage of malicious package intelligence platforms, we further go through the existing popular platforms and investigate their limitations on spreading in-time intelligence. Specifically, we conducted a manual investigation on popular platforms that are the major providers of malicious package intelligence. Specifically, we collected the list by including: 1) top-10 intelligence sources that contain the most malicious package information by keyword searching on Google and 2) the well-known databases that contain structural information of malicious packages.
Table 1 presents the detailed characteristics of each platform. Specifically, we find that 1) although most platforms release blogs or posts to reveal the newly-identified malicious packages, but seldom of them are capable of processing these intelligence and provide structural databases. Moreover, 2) although platforms like Snyk, OSV, and GitHub Advisory offer some form of structured database for malicious software packages, they still face some limitations. For instance, these databases all suffers from low coverage of malicious packages, lack of comprehensive information (e.g., specific malicious package versions and type), not fully publicly available, and late updates of newly identified malicious packages.
Contribution:
We proposed IntelliRadar, a comprehensive LLM-based SSC intelligence analysis platform for the complete and in-time collection of malicious package intelligence, achieving an F1-score of 94.87%.
We constructed a comprehensive and human-validated dataset containing intelligence on 34,313 malicious packages, establishing the largest known database for PyPI and NPM package managers to date, which is publicly accessible through our website.
Our approach demonstrates excellent cost-efficiency, with IntelliRadar requiring only $7 monthly for monitoring all relevant web pages, and identifying each piece of intelligence costing merely $0.003.
We reported intelligence on over 1,981 malicious packages to downstream mirror maintainers, significantly contributing to the security of the open-source ecosystem.
Intelligence Source Distribution: IntelliRadar - Sources (google.com)
PyPI Malicious Packages Intelligence: IntelliRadar - PyPI (google.com)
NPM Malicious Packages Intelligence: IntelliRadar - NPM (google.com)
Maven Malicious Packages Intelligence: IntelliRadar - Maven (google.com)
Other Malicious Packages Intelligence: IntelliRadar - Other (google.com)