Figure 1. The feature model of malware
We identify the common building blocks for Android malware as attack features and evasion features, and propose to use feature model as the meta model to capture the malware. Different from focusing on requirements in malware ontology analysis, we consider and maintain the traceability between the features and their corresponding code in our model. One partial feature model of malware of privacy leakage in GENOME is shown in Fig.1. Note that feature-oriented domain analysis relies much on the domain knowledge of security experts, and only feature and their code relevant to attacks are manually modularized. We totally identify 266 attack features and 14 evasion features. Note that our feature model is not a complete one, but can be extended to covering new attack behaviors and evasion methods.
Table 1. AFs and EFs in the feature model of Android malware
In addition, interested readers can download the dependency relationship among features: dependency.conf
The selected features of a feature model is encoded using an array-based chromosome. Given a chromosome of length n, array indices are numbered from 0 to n −1 . Each feature (no matter AF or EF) is assigned with an array index starting from 0. Each value on the chromosome is z i such that zi ∈ Z ∧ zi ∈ {0,1} , where 0 (resp. 1) represents the absence (resp. presence) of the feature. Given a feature model M, we define a function f(M) : Fea(M) → {Z,⊥} that maps each feature f of the feature model M to an array index. f M ( f1 ) = ⊥ denotes that there is no array index that is assigned for the feature f1. Similarly, we define fM−1: Z → Fea(M) as a function that maps a given array index to the feature it represents. Thus, gene crossover is just the array exchange at a certain index, and gene mutation is bit flipping of the value at a certain index of the array.
To serve as the goals of malware generation, we propose three objective functions in the evolution of malware: aggressiveness, evasiveness and detectability. As the results of the arms race, malware
are getting more aggressive with minimum evasion features needed, but less detectable. Given a chromosome x, we represent it as a bit vector of attack and evasion features, where { fa1 ... fan} denotes the set of n attack features and { fe1 ... fem } denotes the set of m evasion features. The objective functions are defined as follows:
where || fai|| returns 1 if fai is selected and returns 0 if not.
Definition 1 Aggressiveness means the severity of damages that malware may cause to users, which is measured by the number of contained AFs and formally defined as
F1(x) = ∑ ni=1 || fai || (1)
Definition 2 Evasiveness means the efforts to hide the malicious intent and evade the detection. Attackers want to minimize such effects of using evasion techniques to evade detection. It is measured by the number of contained EFs and defined as follows
F2(x) = ∑mi=1||fei|| (2)
where kf ei k returns 1 if f ei is selected and returns 0 if not.
Given a chromosome x and the set of AMTs Sd, { d1 ... dt}, we have the following definition for the last objective function:
Definition 3 Detectability means the difficulty in detecting the malware. It can be measured by detection results of AMTs and defined as follows
F3(x) = (∑|Sd|i=1Di(x)) / | Sd| (3)
where Di(x) returns 1 if the malware of x is detected by the tool di and |Sd| denotes the number of the tools.
Rather than encode three objectives into one weighted fitness function, we treat all the three objectives equally and solve Multi-objective Optimization Problems (MOPs) using the Pareto dominance relation. A k-objective optimization problem could be written in the following form 2 (in our case, k = 3):
Minimize F = (F1(x),F2(x),...,Fk(x)) (4)
where F is a k-dimensional objective vector and Fi(x) is the value of F for ith objective.
Definition 4 Given two chromosomes x, y ∈ Bn and an objective vector F : Bn → Rk, x dominates y ( x ≺ y) if
∀i ∈ {1,...,k} Fi(x) 6 <= Fi(y) (5)
∃j ∈ {1,...,k} Fj(x) < Fj(y) (6)
otherwise x !≺ y
Definition 5 Given chromosomes x and a set of chromosomes Sx, x is non-dominated iff
∀xi ∈ Sx xi!≺ x (7)
Algorithm 1 shows how IBEA guides the feature selection. The input of this algorithm is the feature model of Android malware, the number of iterations for the generation process, and the population size. In the beginning, it creates the initial population, randomly selecting features in the feature model (line 4 to 7), otherwise we can generate new malware derived from the existing malware (line 9 to 15). First, it selects two candidates with a probability and do an ibea crossover operation (line 11) to generate a new malicious app. Second, it selects one candidate in a probability and do an
ibea mutation operation (line 14) to generate a new malicious app. Once the generation is created, all apps in this generation are evaluated by the proposed three objectives to get the fitness value (line 16,17). The algorithm utilizes the Pareto dominance relation to remove malware that is dominated by others in the evaluation (line 18 to 20). The algorithm will stop once the selected features converges (line 21) or exceeds the maximal times of iteration (line 2).