Data Source
In order to gain a comprehensive understanding of the current state and development trends of the malicious code in PyPI ecosystem, we first conducted an in-depth study of relevant literature. Through a detailed analysis of these publications, we systematically investigated the sources and construction methods of malicious code datasets. We found multiple potential sources of malicious code in real-world scenarios, which can serve as the basis for building rich and diverse datasets. Ultimately, we categorized the sources of malicious code into seven classes, covering various types of malicious code, including code snippets, complete projects, and single-file forms. This classification approach helps to build a dataset with higher quality and diversity, better to reflect the realistic distribution and attributes of malicious code. It is worth noting that our proposed classification method applies not only to building a Python malicious code dataset but also to other programming languages. In addition, we performed a detailed evaluation of the dataset's quality to ensure the data's accuracy and reliability. Therefore, we expect this dataset to support related research and applications strongly.
Paper Database
A dataset of open-source malicious code from published papers, which are often cited for validation by other researchers and are a source of usable datasets. The Backstabbers-Knife-Collection dataset contains 250 PyPI malware packages, and SourceFinder exposes 2300 malware repositories collected from the GitHub collection of malware repositories.
Code Hosting Websites
There are many malicious code repositories in open-source code hosting websites, and some developers often maintain the repositories, making these malicious codes of high quality. Common hosting platforms include Github and Gitee.
Paper Official Software Repository
To make development more efficient, Programming languages provide a third-party repository containing many reusable software packages. Attackers upload packages embedded with malicious code to the platform to confuse developers into downloading them for an attack. Third-party repositories contain many malware packages and are a source of available malicious code datasets. Typical third-party warehouses include PyPI, npm, RubyGems, etc.
IT Technology Websites
IT technology websites have the problem of spreading malicious code. The open-source exchange community serves as a platform for developers to exchange programming knowledge, but studies have revealed that some developers share specific code to achieve malicious behaviors. This issue is particularly prominent in hacker communities and underground forums, where some hackers share the source code of malware used in real-life attack activities.
Open Database
Open databases store large numbers of malicious code source files. Some malware databases also contain high-quality source code data, such as "VX-Underground". At the same time, "Exploit Databases" also contain a lot of exploit code, a type of malicious code that an attacker writes specific malicious code to exploit an existing vulnerability to achieve malicious behavior.
Attack Tools
Network penetration tools can create shellcode, a malicious code executed by exploiting software vulnerabilities. It is frequently used to escalate vulnerability privileges and establish remote backdoors. Cobalt Strike is a framework-based penetration tool that can build reverse connection shells and Trojan files and is widely utilized by attackers to generate malicious code.
Others
Other sources are outside the six categories above, such as malicious code written by the researchers.