FCGHUNTER - Dataset

Supplementary Details on Dataset Collection

Both platforms are widely recognized in Android research, and have been used in numerous published studies[1, 2].

Malware Collection: Malware samples were collected from VirusShare (https://virusshare.com/) between 2018 and 2023. Each year, 1000 malware samples were randomly selected, resulting in a total of 6000 malware samples across 6 years.

Benign Collection: Benign samples were downloaded from AndroZoo (https://androzoo.uni.lu/) following the same time frame (2018 to 2023). Similar to the malware collection, 1000 benign samples were selected each year, bringing the total to 6000 benign samples over 6 years. If anyone wishes to download the malware by themselves, they must apply for an account on VirusShare first.

Then, we deal with the collected dataset:

1. Preprocessing: During this collecting process, we extracted Function Call Graphs (FCGs) and computed graph embeddings for all the collected samples. If a sample encountered issues during any stage of this process, such as producing an invalid FCG (e.g., an empty graph) or resulting in a None feature after graph embedding, we removed the sample to ensure that only valid samples were used in our experiments.

2. Training and Test Split: To train stable target models, we split the dataset from each year in an 80:20 ratio for training and testing. This results in 9600 samples for training (4800 malware and 4800 benign samples evenly distributed across 6 years) and 2400 samples for testing (1200 malware and 1200 benign samples).

3. Attack Samples Preparation: For adversarial testing, we additionally selected 120 malware samples from VirusShare, evenly spread across the 6 years (20 malware per year). This ensures that the attack samples cover a wide range of time periods, allowing for robust adversarial evaluation across different time spans. We used the target models to verify that they are indeed malware and ensured that they were excluded from the training set.

[1] Allix, Kevin, et al. "Androzoo: Collecting millions of android apps for the research community." Proceedings of the 13th international conference on mining software repositories. 2016.

[2] "Researches based on VirusShare" ,https://virusshare.com/research.

Introduction to the Published Dataset

For security reasons, we have not made the original APK files publicly available. On one hand, they would be flagged and blocked by Google’s platform; on the other hand, they are extremely large in size. Therefore, we have provided as much information about the dataset as possible, including the SHA256 hashes and extracted features of the training set and test set, as well as the extracted FCGs of the attack samples. The SHA256 hashes serve as unique identifiers to retrieve the APKs, the features can be directly used for further research, and the extracted FCGs serve as input for our project.

1. SHA256: including training set, test set and attack samples.Each entry is provided in a text file with the format: sha256, year.

https://drive.google.com/drive/folders/1kcZe3PYCcd-w64VlqASjKjISEngPMvn0?usp=drive_link

2. Feature Files (npz files):

https://drive.google.com/file/d/1AjNfQw7Z2Vom8KPpqfO6nHKimE44pEc4/view?usp=drive_link

3. Attack Samples (gexf files):

https://drive.google.com/file/d/1OWIWVVjifCv3iByRP4IBx-Sn8d49oB_4/view?usp=drive_link

Bak to Main Page

Page updated

Google Sites

Report abuse