Prior studies on machine learning (ML) based Android malware detection often report very high accuracy, with some achieving over 99% detection performance. However, it is unclear why the ML models perform so well and whether the high accuracy is truly reflective of the models learning to distinguish malware from benign apps based on relevant features.
This paper aims to investigate the underlying reasons behind the high performance of ML-based Android malware detectors using explainable AI techniques. The goal is to understand if the ML models are actually learning to identify malware based on features indicative of malicious behavior, or if they are relying on other irrelevant factors to make predictions.
By leveraging explainable AI approaches to analyze feature importance in the ML models, this study seeks to determine the reliability and realistic performance of ML-based Android malware detectors. The findings can help inform the development of more robust experimental setups for evaluating such models.
To investigate the underlying reasons behind the high performance of ML-based Android malware detectors, we take the following approach:
Replicate three well-known explainable ML-based Android malware detection approaches:
Drebin with linear Support Vector Machine (SVM)
XMal with attention-based multi-layer perceptron (MLP)
Fan et al. with model-agnostic explainable approaches like LIME
Evaluate these approaches under different experimental settings with varying temporal consistency between malware and benign samples:
Baseline: Temporally consistent malware and benign samples
Variant 1: Latest 3 years of temporally consistent samples
Variant 2: Earliest 3 years of temporally consistent samples
Variant 3: Latest 3 years malware, earliest 3 years benign
Variant 4: Earliest 3 years malware, latest 3 years benign
Analyze the explanations generated by the explainable ML approaches to understand:
What features contribute most to the predictions
If models are identifying malware based on malicious behavior or temporal differences
Impact of temporal inconsistency on feature importance
Assess sensitivity of the impact of temporal inconsistency by varying:
Malware to benign ratio in training data
Degree of temporal inconsistency between malware and benign