Our structural attack aims at modifying malware to escape target systems' detection with minimal modifications: modifying the fewest edges and nodes. Our attack considers four types of attack actions and the decision of each action consists of determining an action type and selecting attack objects.
For greedy algorithm that chooses locally optimal solutions and concretes them together to approximate the optimal global solution, it will always select adding edge at each step because adding edge modifies one edge each time, rewiring modifies two edges, inserting nodes modifies one edge and one node, and deleting nodes modifies at least one node and one edge. However, adding edges alone cannot always achieve optimal perturbations. For example, when attacking, if decreasing the degree centrality of certain nodes is locally optimal for the current state, adding edges cannot achieve it because adding edge can only increase the degree centralities. Besides, our experimental results ~\ref{exp:single action} also demonstrate that adding edges cannot achieve desirable attack performance.
Evolutionary algorithm (EA) effectively solves problems whose policy spaces (i.e., attack action set) are small or can be structured~\cite{sutton2018reinforcement}. In our scenario, the policy space is enormous (number of nodes and edges in target graph) and cannot be structured because each attack action consists of a) selecting an action type and b) determining attack objects. The determination of b) depends on a), and a) is related to previous actions (S.3).
HRAT uses reinforcement learning, specifically deep Q-learning, to determine the attack action on the target graph. Our policy model involves two parts: the first part uses a neural network to learn the relation between state and action type with the loss function based on reward. Then, with the determined action type, HRAT uses gradient search to determine attack objects. Besides, to evaluate the effectiveness of each attack action, our reward function, i.e. Equation (7), is designed as the modification numbers nodes and edges. This reward function evaluates the impact of both action type and attack objects. For example, for adding one edge, the reward value is -1 as only one edge is added, no matter which edge the gradient search selects. For rewiring, the reward is -3 because one edge is removed and 2 edges are added. For inserting nodes, the reward is -2 because one node and one edge are inserted. For deleting nodes, the value of reward depend on the node the gradient search selects. Because deleting nodes will remove one node from the target graph and build the connections from each of the deleted node's caller to all of its callee. The reward value depends on the number of deleted nodes' callers and callees.