Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Abstract

Vulnerabilities are the root cause of cyber threats. Thus vulnerability identification is crucial to protect the software systems from attacks. However, vulnerability discovery is a challenging and tedious process, and also requires specialized security expertise. In this paper, we propose a graph neural network based approach to automatically learn the potential vulnerable patterns from source code at the finest function level and detect the vulnerabilities from end to end. To cover various types of vulnerabilities, we combine a rich set of code semantic representations and the natural sequence of source code into a joint graph structure. To efficiently exploit the learned individual node representations for graph-level classification, we enhance the standard classification method by CNNs to extract important nodes and features. We extensively evaluate our approach with manually labeled data sets built on 4 diversified opensource C projects. The results demonstrate that our approach outperforms the state-of-the-arts significantly with an average 10.52% higher accuracy, increases averagely 4.66% accuracy via the improved classifier. Meanwhile it corroborates the advance of the comprehensive program semantics against individual representations with an average of 1.57% accuracy gain.

Datasets Introduction

Due to the space limit in the paper, we can not describe this section in more details. So here we supply more details as a supplementary. We use 4 popular and diversified open-source libraries: Linux, FFmpeg, Qemu and Wireshark to check if our methods can be used to predicted as many types of vulnerabilities as possible.

These four projects providing different functionalities tend to have diversified types of vulnerabilities and root causes even for the same type of vulnerabilities, hence we can use as our base datasets. For the projects, 1) Linux kernel provides the essential services for all parts of the operation system, helps with process and memory management, file systems, device and networking. That is why most vulnerabilities in Linux are memory corruption and race condition. 2) Qemu is a hosted virtual machine monitor that performs hardware virtualization. The major types of vulnerabilities are DoS, code execution caused by overflow etc 3) Wireshark is an open-source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education. The major types of vulneraries are memory leak, DoS caused by craft packets and overflow, and et. etc. 4) FFmpeg consists of a vast software suite of libraries and programs for handling video, audio, and other multimedia files and streams. The major types of vulnerabilities are overflow.

Due to the enormous amount of commits in Linux repository, only commits from 2016 and 2017 are admitted to balance the dataset. We have totally 48,687 commits with confirmed labels, 12811 from Linux, 13962 from FFmpeg, 11910 from Qemu, and 10004 from Wireshark (shown in Table 1 Data Sets Overview).

Commits Filtering

As most of commits are security unrelated. We use a set of security-related keywords and the regular expression method to exclude unrelated commits. We pick out the crucial words that frequently appeared in the commit messages. The word list are shown in the Table 2 Keywords in Filtering Process.

We can separate this keywords list into two main categories: library-specific and general keywords. In the general word list there are some obvious vulnerable features such as divide by zero, credential hack and crash. In the library-specific list, the keywords such as KASAN or oops also indicate the vulnerability.

Manual Labeling

The manual labeling process consists of two major steps:

  • Initial verification. A group of three security researchers (bachelors) hired by the industrial collaborator examine each filter commit. The company is a security company focuses on vulnerability identification and management. All researchers are professional security experts that have reported CVEs (Common Vulnerabilities and Exposures) to US’s NVD (National Vulnerability Database of their own). We ensure every commit is checked by two different security researchers.

  • Final confirmation. If the labels of a commit by two researchers are different or include unsure labels, the commit would be forwarded to a senior researcher (PhD) for verification.

After two round verification, if a commit remains unsure, it will be kept out of the data set.

At the end of the manual validation step, we got totally 23355 vulnerability-fix commits. We can assume that each commit fixes 1 vulnerability though there are some rare cases that one commit fixes multiple vulnerabilities, and there would be roughly 23k vulnerabilities. We have not classified manually the types of vulnerabilities yet because it is highly costly and requires root the cause analysis on the source code sometimes. But roughly from the description of commit messages, we found most vulnerabilities are memory-related, like buffer overflow, memory leak, crash and corruption, mainly due to the language of the 4 projects are C/C++. The types of vulnerabilities are also project-related. For example, in Linux kernel, lots of vulnerabilities are crash and race-condition.

Partial Dataset

we release two projects FFmpeg, Qemu, which contain raw csv data and functions extracted from the commit.

  • The csv file contains three columns: sha_id, project and vulnerability (Label to indicate if the commit is an vulnerability fixed or not).

  • The function dataset is constructed from the commit id list:

    • commit_id

    • project

    • func

    • target: 0 or 1 (vulnerability or not)




Table 2: Keywords in Filtering Process

Node Representation in details

The node features are the representation of the nodes in the graphs. Devign uses GGNN to learn the node representation. The initial node representation is briefly explained in the last paragraph of Section 3.1 (e.g., last para in left column, page 5). The initial node representation comprises two parts: one denotes the node types (as illustrated in Figure 1), and the other denotes the semantic meaning of the source code. For example, the code statement ‘a = a+ b’ is an Assignment node, while the code token ‘a’ is an Identifier Node. The node types are encoded using label encoding. For the semantic meaning of the source code, we tokenized the source code into tokens, and use the word2vec to learn the initial embedding of each token. For the statement ‘a = a + b’, it would be parsed as <a, Operator_assignment, a, Operator_add, b>. The embedding of the node for ‘a = a+ b’ would be the average of the 5 tokens. The embedding of the node ‘a’ is the learned embedding of token a in source code.

Node representation of statement 'a = a + b'