Dataset and source code for our publication on which we examine several Transformer-based pre-trained LLMs for the downstream task of Vulnerability Prediction in software systems.
Authors
Ilias Kalouptsoglou, Miltiadis Siavvas, Apostolos Ampatzoglou, Dionysios Kehagias, and Alexander Chatzigeorgiou
Abstract
The rise of Large Language Models (LLMs) has provided new directions for addressing downstream text classification tasks, such as vulnerability prediction, where segments of the source code are classified as vulnerable or not. Several recent studies have employed transfer learning in order to enhance vulnerability prediction taking advantage from the prior knowledge of the pre-trained LLMs. In the current study, different Transformer-based pre-trained LLMs are examined and evaluated with respect to their capacity to predict vulnerable software components. In particular, we fine-tune BERT, GPT-2, and T5 models, as well as their code-oriented variants namely CodeBERT, CodeGPT, and CodeT5 respectively. Subsequently, we assess their performance and we conduct an empirical comparison between them to identify the models that are the most accurate ones in vulnerability prediction.
Keywords
Software security, Vulnerability prediction, Transfer learning, Large language models, Transformer
The tokenized dataset with the source code of Python files retrieved by GitHub can be found in the following:
It is an extension of the dataset provided by Bagheri et al. in:
Bagheri, A., Hegedűs, P. (2021). A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python. In: Paiva, A.C.R., Cavalli, A.R., Ventura Martins, P., Pérez-Castillo, R. (eds) Quality of Information and Communications Technology. QUATIC 2021. Communications in Computer and Information Science, vol 1439. Springer, Cham. https://doi.org/10.1007/978-3-030-85347-1_20
The source code that implements all the experiments described in the publication is provided in the following GitHub repository:
https://github.com/certh-ai-and-softeng-group/llmBased_vulnPrediction
We provide the replication package in a zip file, as well. You may find the zip file in the link below: