Deep Understanding of Commits for Automated Vulnerability Identification

Abstracts

Vulnerabilities are the root cause of security threats. Protection of software systems against cyber attacks crucially depends on the detection of both unknown and known vulnerabilities. Despite the National Vulnerability Database (NVD) keeps publishing identified vulnerabilities, a vast majority of vulnerabilities, even though fixed, remain beyond public exposure, E.g. in the open-source libraries that are heavily relied on by developers.

To efficiently find these private vulnerabilities at large scale and low cost, we pioneer the study of a deep neural-network-based approach built upon commits of open-source repositories.

First, we design and build commit-vulnerability datasets that include 48,687 security-related and manually-labeled commits from four large-scale C programming language libraries.

Based on the datasets, we devise and implement a deep learning-based vulnerability detection system that consists of two composite neural networks: one commit-message neural network that utilizes pre-trained word representations learned from massive unlabeled commits; and one code-revision neural network that takes inputs respectively code before revision and after revision and learns the distinction on the statement level.

Our system leverages the power of the two networks for vulnerability identification.

Evaluation results show that our system significantly outperforms the state of the arts. The result on the combined dataset achieves as high as a 90.66% recall and a precision of 87.31%. The F1 score is 6.07% and 12.92% higher than that of the best two benchmarks.

Moreover, we deployed the pipeline and learned model in an industrial production environment and the observation on 278,081 commits from 366 new libraries proved the effectiveness of our solution.

Data sets

Our team has decided to release part of the dataset that were used in the paper and experiments for the task of commit fixes classification.

The dataset contains of commits that has been crawled from projects, such as wireshark, qemu, FFMpeg and Linux. These commits has been labelled by researchers and it indicates whether the commit is an vulnerability fix or not. To enable better understanding of the paper, we had released the dataset for the two of the projects.

There are three columns in the csv, commit_msg, patch, vulnerability.

commit_msg: The commit message of the particular commit.
patch: The diff patch file of the commit
Vulnerability: Label to indicates if the commit is an vulnerability fix or not

The link below contains part of the labelled dataset.

Please download the data

Google Sites

Report abuse