Source code for our publication on which we propose a Sequence-to-Sequence software vulnerability localization method.
Authors
Ilias Kalouptsoglou, Miltiadis Siavvas, Apostolos Ampatzoglou, Dionysios Kehagias, and Alexander Chatzigeorgiou
Abstract
The development of secure software systems depends on early and accurate vulnerability identification. Manual inspection is a time-consuming process requiring specialized knowledge. Therefore, as software complexity grows, automated solutions become essential. Vulnerability Prediction (VP) is an emerging mechanism that identifies whether software components contain vulnerabilities, commonly using Machine Learning models trained on classifying components as vulnerable or clean. Although the advances in Deep Learning and Natural Language Processing have fueled promising developments in VP, several challenges persist. A key challenge is the inability of existing models to detect the exact vulnerable code lines. Recent explainability-based approaches attempt to rank the lines based on their influence on the output of the Vulnerability Prediction Models (VPMs). However, they depend on the kind and accuracy of the file or function-level VPMs, inherit possible misleading patterns, and cannot indicate the exact code snippet that is vulnerable nor the number of vulnerable lines. To address these limitations, this study introduces an innovative approach based on fine-tuning Large Language Models on a Sequence-to-Sequence objective to return the vulnerable lines of a given function. Results on the Big-Vul dataset demonstrate the proposed method’s superior localization accuracy, marking an important advancement in fine-grained vulnerability detection.
Keywords
Vulnerability detection, Large Language Models, Sequence-to-Sequence models, Self-attention
The source code that implements all the experiments described in the publication is provided in the following GitHub repository (recommended):
https://github.com/iliaskaloup/LocVul
We also provide the replication package in a zip file. You can find the zip file at the link below:
drive.google.com/file/d/1wEJpsXbK_h3KG8f6sGzQM-2u2d1eggZj/view?usp=sharing
You may also find the zip file with the replication package including the produced Vulnerability Detection Models in the link below:
mega.nz/file/lalj2RjZ#QnotKhA60rSO-OwcJazMRO39lmAbGXbJEPqj_pBEOU0
Guidelines (Included in the README file as well)
To replicate the analysis and reproduce the results:
git clone https://github.com/iliaskaloup/LocVul.git OR Download .zip file from the links above.
and navigate to the cloned repository.
Inside the LocVul folder in the main branch, there is a yaml file:
• torchenv.yml file, which is the python-conda virtual environment (venv) that we used.
There are 5 python scripts in the root directory and 1 folder:
• data_mining.py: It downloads the dataset and saves it as dataset.csv in the folder "data".
• vulnDet_pipeline.py: It fine-tunes CodeBERT for function-level vulnerability predictions and then it employs Self-Attention method to localize the vulnerable lines, evaluating this explainability-based approach
• Seq2Seq_vulnDet.py: It fine-tunes CodeT5 for line-level vulnerability detection
• seq2seq_eval.py: It executes both CodeBERT and CodeT5 models to find vulnerable functions and the vulnerable lines inside them, evaluating the line-level performance of the Seq2Seq approach
• visualize.py: It produces the bar charts that are presented in the paper to compare LocVul with the Self-Attention approach
• jupyter/ contains the jupyter equivalents of the python scripts
To construct the dataset.csv run:
python data_mining.py
To train the function-level model (CodeBERT) and evaluate the Self-Attention approach (training and inference) run:
python vulnDet_pipeline.py --seed=9 --FINE_TUNE=”yes” --model_variation=”microsoft/codebert-base” --checkpoint_dir=”./checkpoints” --sampling=”no” --REMOVE_MISSING_LINE_LABELS=”yes” --EXPLAINER="ATTENTION" --EXPLAIN_ONLY_TP=”no” --sort_by_lines=”yes”
To evaluate the Self-Attention approach (inference only) run:
python vulnDet_pipeline.py --seed=9 --FINE_TUNE=”no” --model_variation=”microsoft/codebert-base” --checkpoint_dir=”./checkpoints” --sampling=”no” --REMOVE_MISSING_LINE_LABELS=”yes” --EXPLAINER="ATTENTION" --EXPLAIN_ONLY_TP=”no” --sort_by_lines=”yes”
To train the line-level model (CodeT5) run:
python Seq2Seq_vulnDet.py --seed=9 --FINE_TUNE=”yes” --model_variation=”Salesforce/codet5-base” --checkpoint_dir=”./checkpoints_seq2seq”
To evaluate the Sequence-to-Sequence approach run:
python seq2seq_eval.py --seed=9 --model_variation="microsoft/codebert-base" --model_variation_seq2seq="Salesforce/codet5-base" --checkpoint_dir="./checkpoints" --checkpoint_dir_seq2seq="./checkpoints_seq2seq" --sampling=”no” --REMOVE_MISSING_LINE_LABELS="yes" --ONLY_TP="no" --sort_by_lines="yes" --SIMILARITY_REPLACEMENT="yes"
To evaluate the Sequence-to-Sequence approach without the similar line replacement mechanism that handles hallucinations run:
python seq2seq_eval.py --seed=9 --model_variation="microsoft/codebert-base" --model_variation_seq2seq="Salesforce/codet5-base" --checkpoint_dir="./checkpoints" --checkpoint_dir_seq2seq="./checkpoints_seq2seq" --sampling=”no” --REMOVE_MISSING_LINE_LABELS="yes" --ONLY_TP="no" --sort_by_lines="yes" --SIMILARITY_REPLACEMENT="no"