Mobile malware attempts to evade detection during app analysis by mimicking security-sensitive behaviors (e.g., sending SMS messages) of benign apps that provide similar functionality, and suppressing their payload to reduce the chance of being observed (e.g., only executing its payload at night). Since current approaches focus their analysis on the types of security-sensitive resources being accessed (e.g., network), these evasive techniques make differentiating between malicious and benign apps a difficult task during app analysis. We propose that the malicious and benign behaviors within apps can be differentiated based on the contexts that trigger security-sensitive behaviors, i.e., the events and conditions that cause the security-sensitive behaviors to occur. In this work, we introduce AppContext, a static program analysis-based approach that extracts the context of security-sensitive behaviors to assist app analysis with differentiating between malicious and benign behaviors. We implement a prototype of AppContext and analyze 202 malicious apps from various malware datasets, and 633 benign apps from the Google Play Store. AppContext correctly identifies 192 malicious apps with 87.7% precision and 95% recall. Our evaluation results suggest that the maliciousness of a security-sensitive behavior is more closely related to the intentions of the behavior (reflected via contexts) rather than the types of the security-sensitive resources that the behavior accesses.
AppContext: Differentiating Malicious and Benign Mobile App Behavior using Contexts
To appear in Proc. of 37th International Conference on Software Engineering (ICSE 2015), Florence, Italy, May 2015. (acceptance rate: 18.5%, 84 out of 452)
AppContext is an approach that extracts the values of elements in our context definition, focusing on informing app reviewers or expert users of imperceptible contexts of permission uses. AppContext defines a concise abstraction of contexts of permission uses and performs four-step static analysis on the byte code of apps to extract the contexts.
First, AppContext constructs a call graph from an apps binary and performs static analysis to locate its security-sensitive behaviors. Next, AppContext identifies activation events by the entry points of the computed call graphs, and converts the call graphs into an ECG by using ICC information. Then, AppContext constructs ICFGs for each call path to security-sensitive method calls in the ECG and traverses each ICFG to find conditional statements sets. Next, AppContext generates the complete contexts after extracting the context factors via information flow analysis. Finally, AppContext classifies the security-sensitive behaviors by using the features of the extracted context.
Our definition of context includes two important characteristics that determine the invocations of security-sensitive method calls: activation events and context factors .
UI events are user actions interacting with app-specific
System events represent the events generated by
a smartphone's software/hardware.
UISystem events are general user interactions on interfaces of systems or devices that can change the lifecycle of an Android component.
As shown in Figure 1, the context factors are environmental attributes (i.e., data from external environment) that control/affect the invocations of security-sensitive method calls (in this case, SmsManager,sendTextMessage()).
AppContext leverages Soot as its underlying static analysis framework. AppContext uses Dexpler, which is part of the Soot framework, to convert Dalvik bytecode into the Jimple intermediate representation from which Soot constructs its call graph. AppContext also leverages FlowDroid, a static taint analysis tool based on Soot, to provide a precise modeling of the Android component life cycles and callbacks methods. We also modified the Flowdroid's information flow analysis to compute the information flow from the environment-property method to conditional statements.
Security-sensitive method identification
To extract the contexts of permission uses, AppContext uses the permission mappings provided by PScout as input and performs the analysis. Since AppContext relies on PScout's mappings, the soundness and completeness of the mappings may affect the number of false positives and false negatives produced by AppContext.
The limitations of our implementation lie in two aspects. First, AppContext is confined to the limitations of the nature of static analysis. Since static analysis extracts what can happen rather than what does happen, AppContext may produce false positives. We also incorporate a few heuristics into the static analysis to infer the invoking relationships that are not included in the call graph of traditional analysis. Doing so leads to imprecision as some of the invocation behaviors are not deterministic (e.g., Handler.sendMessage). Second, AppContext uses the lists of method signatures provided by PScout to identify the corresponding methods. As some of the lists are not 100% complete, AppContext may not be able to identify some of the hidden APIs in the Android platform. Although AppContext is built on top of existing static analysis tools (i.e., FlowDroid and Soot) and uses the permission mappings from previous work (i.e., PScout) as input for its analysis, the AppContext approach is independent of the underlying static analysis frameworks and mappings. Therefore, the precision of the current implementation of AppContext can be improved as analysis frameworks and permission mappings are improved.
To extract activation events, AppContext chains all ICCs within the app and constructs an extended call graph (ECG) (Figure 2) to infer activation events.
To compute context factors, AppContext combines the control flows of all components from entry points of the activation events to the security-sensitive method call in an inter-procedure control flow graph (ICFG) (Figure 3), and leverages information flow analysis to identify the environmental attributes that affect the control flows.
Figure 3. Inter-procedure control flow graph
Our subject apps include 846 Android apps in total (633 benign apps, 202 malicious apps, and 11 open-source apps).
For malicious apps, we randomly selected 130 malicious apps from Genome malware dataset, 30 malicious apps from the VirusShare dataset, and 50 malicious apps from Contagio dataset. We also selected 17 malicious apps identified by VirusTotal that were posted on Google Play in 2013, but were later removed. These malicious apps cover the majority of existing Android malware families from 2011 to 2014, which are rapidly evolving to circumvent detection by various mobile security software.
For benign apps, we downloaded the top 500 apps from all of the categories from Google Play in January 2013. From each category, we randomly selected 20 apps that were under 5 MB and 20 apps from the with no size restriction. We chose apps smaller than 5MB because FlowDroid runs out of memory on large apps. We also excluded the apps identified as malware by VirusTotal and the apps that caused FlowDroid to throw exceptions or timeout. Our final malware dataset contains 202 malicious apps, and the final benign dataset contains 633 apps.
For open-source apps, we randomly selected 15 apps from FDroid. Among these 15 apps, we excluded 4 apps that do not have security-sensitive behaviors. Our open-source dataset contains 11 apps. We apply AppContext to extract contexts from the subject apps.
AppContext was running on a desktop with 3.4 GHz Intel Core i7 processor and 8 GB of memory. For 846 subject apps, AppContext takes on average 647 seconds to finish. We set the timeout of AppContext as 80 minutes, and AppContext exceeds the timeout limit for 162 apps (already excluded from subjects).
RQ1: How effective is AppContext in identifying malware? How does AppContext compare to the approach without using context information in terms of malware identification effectiveness?
We use the labeled behaviors (i.e.,
method calls) both as training and test data in a ten-fold cross
validation , which is the standard approach for evaluating
machine-learning techniques. It works by randomly dividing
all data into 10 equally sized buckets, training the classifier on
nine of the buckets, and then classifying the remaining bucket
for testing. The process is repeated 10 times, with each of the
10 buckets used exactly once as the testing data. We report
the average precision and recall in Table III and Table IV.
RQ2: How do activation events and context factors in our
context definition contribute to the malware identification
To answer RQ2, we compute the result by AppContext in
different cases leveraging only partial features listed in Table I.
We apply same supervised learning approach used in RQ1
with the features the activation events (the row of Activation
Events), context factors (the row of Context Factors), behavior
information and activation events (the row of B. & E.),
behavior information and context factors (the row of B. &
F.), and activation events and context factors(the row of E.
& F.), respectively. We compare the results of the analysis in
Table III and Table IV.
RQ3. How accurate is our static analysis in inferring contexts?
To evaluate the effectiveness of the collected contexts, we dynamically verify the security-sensitive method being invoked by triggering the activation events and configuring context factors based on the contexts. The execution path that is triggered by the activation events may differ based on the different values of the context factors. In this evaluation, we use only open-source apps as the subjects. The main reason is that these apps come with source code, which can be used to easily infer the correct values of context factors where activation events could invoke the permission uses. AppContext is applied on 11 open-source apps to generate contexts and the analysis time is logged.
EMPIRICAL STUDY (on effectiveness of contexts in manual inspection)
selected 250 malicious apps from Genome malware set
. These malicious apps cover the major
ity of existing Android malware families in 2011, which are
rapidly evolving to circumvent detection by various mobile
security software. As described in Section 6.3, we exclude
malicious apps with no corresponding app descriptions. For
market apps, we started by downloading the top 500 apps
from each of the 33 categories on the Google Play Store
in January 2013. We then randomly selected 50 applica
tions from each category, reselecting an app if it was greater
than 5 MB in size. We limited our analysis to applications
smaller than 5MB due to running out of memory on large
applications. Our .final analysis dataset also excluded several
apps that caused FlowDroid to throw exceptions. Our .final
market dataset contains 726 applications.
Study 1. Identifying Suspicious Requested Permissions
We evaluate the e.ectiveness of AppContext in identifying
suspicious requested permission using two metrics: (1) the
percentage of the additional suspicious permissions identifi
.ed based on contexts extracted by AppContext in compar
ison to suspicious permissions identi.ed based on requested
permissions ; (2) the percentage of permissions being classifi
ed correctly based on contexts extracted by AppContext.
For evaluation purposes, two authors of this paper follow a
standard procedure that mimics the process of app review
ing to perform two rounds of the manual inspection. For
each of requested permissions in malicious apps and market
apps, we manually inspect and classify the requested permis
ions as suspicious or benign. In the .rst round, we classify
requested permissions based on permission lists. In the sec
ond round, we refer to contexts when classifying requested
permissions. The classi.cation results are individually verifi
ed by two authors serving as reviewers and another author
to eliminate the inconsistencies between the results by com
municating with the two reviewers.
Permission-based classification. In the .1st round of manual inspection, we classify the requested permissions of all of the apps without using the contexts. We check the app's permission list based on the app description, app name, app category, and the common functionalities for apps of the same kind. If none of the functionalities are expected to leverage a permission, we classify the requested permisssion as suspicious.
Context-based classification. In the second round of manual inspection, we examine the contexts of the permission uses to identify suspicious uses. As a permission could be used in di.fferent contexts, each requested permission may have multiple permission uses. A Requested permission is marked as benign only when all of its permission uses are benign. A permission uses is classi.fied based on whether the functionalities can potentially justify the timeframe of using the permission. The timeframe that an app could use a permission is determined by activation events and execution settings of the permission use. For each permission, if none of the functionalites can justify the use of the permission at the point of time when activation events occur, the permission use is marked as suspicious. For each permission use whose execution setting is continued, if none of the functionalites can justify the app to use the permission after users exit the app, the permission use is marked as suspicious.
Table 3 compares the results of permission based classification and context-based classification. It is worth noting that a few of the requested permissions were never used in the code. Overprivilege is common in Android apps. We identify the overprivilege by comparing
the sets of permissions in two classi.cations. We exclude requested permissions in permission-based classification if these permissions have no corresponding permission uses in context-based classification.
Validity of context-based classi.cation. We evaluate the validity of our classi.cation results to con.rm the rationality of our classi.cation practices and the quality of the collected contexts. We evaluate the validity of our classi.cation results by verifying whether the results conforms to the app reviewing decisions on current Google Play App Store. In March 2014, we crawled the Google Play for all the 42 apps classi.fied as suspicious and 20 randomly-selected apps in the benign set. For each app, if the app has already been removed from the store, we mark the app as suspicious. For each permission that marked as suspicious in the context-based classification, if the permission is no longer been requested by the app, we mark the permission as suspicious. We also use android-market-api to download all the available comments of the apps. We use keyword matching to highlight the comments of mentioning the app is suspicious and we manually con.rm those comments. If in the end over half of the comments mention the app is suspicious, we mark the app as suspicious. Table 4 shows our evaluation results.
Study 2. Context Patterns
We study the patterns of contexts to examine the effec
tiveness of context information in assisting automated de
tection techniques to differentiate suspicious permission uses
and benign permission uses.
The study is based on the result of context-based clas
sifi.cation, which divide the permission usespicious contexts). In this study, we define context pattern as the pattern of contexts that can differentiate the benign
into two sets (the set of benign contexts and the set of sus
and suspicious uses of a permission. For each set, if a context occurs more than 10 times for a permission and more than 80% of the permission uses with this context are in one of the sets of the context-based classification, we mark the context as a context pattern. As the characteristics of the market apps and malicious apps could be different, we also differentiate the permission uses of market apps and malicious apps in each set, and marked the frequent contexts and context patterns respectively.
y present an example use-case that synthesizes natural
language sentences from AppContext's output.
Natural Language Descriptions of Contexts.
We synthesize natural language sentences from the permission contexts extracted by AppContext to generate human-readable text to assist app reviewers. We synthesize a natural language sentence for each permission context extracted by AppContext using template-based generation. Each sentence describes the characteristics of (1) the activation event; (2) the behavior performed by the permission invocation; (3) whether or not the behavior occurs in the background; and (4) the name of the permission required to perform the behavior. Sentences are generated using the following template (variables begin with \$"):
When $EVENT, the application may $BEHAVIOR $BACKGROUND (using $PERMISSION).
where $EVENT is a description of the activation event; $BEHAVIOR is a description of a behavior performed; $BACKGROUND indicates whether the behavior occurs in the background or foreground; and $PERMISSION is the name of the permission needed to perform the behavior.
We manually wrote descriptions for events and behaviors that were encountered by AppContext by reading the API documentation. The synthesized sentences are both syntactically and semantically correct because the values of each variable naturally adhere to a specific syntactic structure.
Study 3. Synthesized Context Description for Permission Contexts
We also conduct a two-step study on the context descriptions synthesized from 10 malicious apps, with two expert users of mobile apps (being two authors of this paper) First, these two expert users are given a form (form1) containing the names, the app descriptions, and the permissions of the selected apps. These two expert users then mark each permission claimed by the apps as reasonable or not and provide their installation decisions for each app. Second, these two expert users are given another form (form2) that contains the same information but with the synthesized context descriptions for each permission. These expert users then mark the permissions and provide installation decisions again.
Synthesized Context Description of an Example App:
App Name: 万阅公寓 (Bookworm Apartment)
App Description: 想得到关于2012的最新最爆的内部消息吗？ 想深入探究未解之谜的奥妙吗？ 想体验最 惊悚最灵异的事件吗？ 想了解校园的最新动态吗？ 想成为一名人见人爱的笑话大师吗？ 想一边 欣赏着俊男靓女一边体验着 最新潮的服饰吗？ 想每天学一道美食下班做给你的他/她吗？ 想了解最新最火辣的影讯影评吗？想做一名 让上司信任让同事喜欢的职场达人吗？ 想知道如何讨好你的星座恋人吗？ 史上最有内涵！最深入！最灵异！最博学！ 最多囧文的阅读软件《万阅公寓》上市啦
(Do you want the latest internal explosion news about the 2012? Do you want to delve into the mystery of mysteries? Do you want to experience the most supernatural thriller event? Do you want to know the latest developments on campus? Do you want to be a cute joke master? Do you want experience the latest wave of clothing while enjoying the looks of beautiful faces? Do you want to learn to give your special him / her a gourmet every day? Do you want to know the latest hottest movie information? Do you want to become a person that boss trust and colleagues like? Do you want to know how to please your constellation lover? The most in-depth! The supernatural! Most knowledgeable reading software of all time with most connotation articles and jokes! "Bookworm Apartment" now goes on the market!! ) (Based on Google Translate)
Synthesized Context Description:
When the device is disconnected from a USB cable, the application may retrieve connection status information about a particular network type in the background (using android.permission.ACCESS_NETWORK_STATE permission).
When the device is disconnected from a USB cable, the application may retrieve the device ID in the background (using android.permission.READ_PHONE_STATE permission).
When the device is disconnected from a USB cable, the application may retrieve the unique subscriber ID of the device in the background (using android.permission.READ_PHONE_STATE permission).
When the device is disconnected from a USB cable, the application may open a connection to a URL in the background (using android.permission.INTERNET permission).