Experimental settings and part of evaluation results of the replication study. The full evaluation results can be found on our online supplementary. The second column (i.e., sample dates) includes 11 years of app samples ranging from 2010 to 2020. The black cell represents that the apps (i.e., all the 3,000 apps for the malware set while randomly selected 3,000 apps from the goodware set) in the corresponding time frame are selected for training.
Key Findings: The results showed that introducing temporal inconsistency, where malware samples are older and benign samples are newer (Variant 4), significantly increased the detection performance of all three approaches. Drebin achieved the highest accuracy of 99.27% under Variant 4. In contrast, the temporally consistent settings (Baseline, Variants 1-2) yielded lower performance.
This finding indicates that the high performance reported by many ML-based Android malware detection studies may be attributed to temporal inconsistencies between malware and benign samples in their datasets, rather than the models truly learning to distinguish malware from benign apps based on relevant features. The temporal differences appear to have a significant impact on the classifiers' performance.
Key Findings: The top-ranked features highlighted via explainable machine learning approaches for ML-based malware detection may not always capture the difference between malicious and benign behaviors. They could simply be time-specific features that only exist in either historical or latest apps. For example, under temporal inconsistency where malware samples are older and benign samples are newer, the models consider added features in newer Android versions as top features for identifying benign and removed/deprecated features as top features for identifying malware.
Key Findings: When using testing samples from distinct time periods compared to the training data, ML models still distinguish malware/benign based on the temporal differences learned from the training data, resulting in extremely poor performance. For example, when Drebin is trained on data where malware is older and benign is newer, but tested on data where malware is newer and benign is older, the accuracy drops to only 14%. The explanations show the model still thinks samples with more added features are benign and samples with more removed features are malware, even though the temporal distribution has flipped in the testing data.
Key Findings: All three ML-based malware detection approaches (Drebin, XMal, Fan et al.) provide highly accurate predictions based on temporal differences when temporal inconsistency exists in the training data. Under Variant 4 where malware samples are older and benign samples are newer, all approaches achieve over 98% accuracy and F1 score. The explanation results show the models capture the time differences between malware and benign - malware is likely to include removed/deprecated features while benign is likely to include newly added features.
If the time difference between malware and benign samples changes in the testing data compared to the training data, the ML models trained under temporal biases perform extremely poorly. For example, when Drebin is trained on Variant 4 but tested on Variant 3, accuracy drops to only 14%. The explanations show the model still thinks samples with more added features are benign and samples with more removed features are malware, even though the temporal distribution has flipped in the testing data. This further demonstrates the models are distinguishing malware and benign based on learned time-specific features rather than actual malicious/benign behavior.