JavaScript data with software vulnerabilities.
Authors
Ilias Kalouptsoglou, Miltiadis Siavvas, Dionysios Kehagias, Alexandros Chatzigeorgiou, and Apostolos Ampatzoglou
Abstract
Software security is a very important aspect for software development organizations who wish to provide high-quality and dependable software to their consumers. A crucial part of software security is the early detection of software vulnerabilities. Vulnerability prediction is a mechanism that facilitates the identification (and, in turn, the mitigation) of vulnerabilities early enough during the software development cycle. The scientific community has recently focused a lot of attention on developing Deep Learning models using text mining techniques for predicting the existence of vulnerabilities in software components. However, there are also studies that examine whether the utilization of statically extracted software metrics can lead to adequate Vulnerability Prediction Models. In this paper, both software metrics- and text mining-based Vulnerability Prediction Models are constructed and compared. A combination of software metrics and text tokens using deep-learning models is examined as well in order to investigate if a combined model can lead to more accurate vulnerability prediction. For the purposes of the present study, a vulnerability dataset containing vulnerabilities from real-world software products is utilized and extended. The results of our analysis indicate that text mining-based models outperform software metrics-based models with respect to their F2-score, whereas enriching the text mining-based models with software metrics was not found to provide any added value to their predictive performance.
Keywords
vulnerability prediction; dataset extension; software metrics; text mining; machine learning; deep learning; ensemble learning
Dataset Introduction
We used a dataset supplied by Ferenc et al. (Ferenc, R.; Hegedus, P.; Gyimesi, P.; Antal, G.; Bán, D.; Gyimóthy, T. Challenging machine learning algorithms in predicting vulnerable javascript functions. 2019 IEEE/ACM 7th InternationalWorkshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 2019, pp. 8–14.) for training and assessing our models, which comprises of various source code files written in JavaScript programming language obtained from real-world open-source software projects available on the GitHub site. This dataset was used to develop software metrics-based vulnerability prediction models. The authors gathered vulnerabilities from the Node Security Platform (NSP) and the Snyk Vulnerability Database, both of which are publicly available vulnerability databases. They retained the URL for each file with vulnerabilities and extracted a set of fixing commits by traversing these URLs. They used these commits to merge all of the code modifications into a single patch file that included all of the repairing commits. They used the GitHub API to get this information. They also identified the initial commit in time corresponding with each system's vulnerability fix as the parent commit. All parent commit functions that were affected by the fixing modifications were considered vulnerable, whereas functions that were not affected by the code changes were considered non-vulnerable. The dataset can be found in the following address:
http://www.inf.u-szeged.hu/~ferenc/papers/JSVulnerabilityDataSet/
Dataset Extension
The following figure illustrates the whole dataset construction process. We got the dataset provided by Ferenc et al. in CSV format and then we gathered all the github URLs of the dataset's files. Using these URLs, we collected the source code of these files from GitHub. Subsequently, by utilizing the start and end line information for every function, we cut off the code of the functions. Each function was then tokenized to construct a list of tokens per function.
To extract text features, we used two text mining techniques: (i) the Bag of Words, and (ii) the Sequences of Tokens. As a result, we created a repository with all methods' source code, a CSV file with the software metrics extracted by Ferenc et al., the token sequences of each method, and the BoW format of each method. To boost the generalizability of type-specific tokens, all comments were eliminated, as well as all integers and strings, which were replaced with two unique IDs. The dataset contains 12,106 JavaScript functions, from which 1,493 are considered vulnerable.
The tokenized dataset along with the software metrics can be found in the following:
The python scripts that perform the source code fetching from GitHub and extract text features are also provided in order to facilitate future research and different feature extraction processes:
The script data_fetching to download the source code of each function of the dataset
The script data_cleansing to extract sequences of text tokens
The script BoW_extraction to extract Bag of Words features