We adopt the effective ML defenders, Support Vector Machine (SVM), XGBoost and Random Forest. In particular, we follow Barushka et al. (2018) to adopt TF-IDF to obtain the feature of the textual information and train the ML model to classify the phishing emails.
We adopt both the encoder-only model BERT and RoBERTa, and the decoder-only model GPT-2 to implement the defenders. For the former, we concatenate the [CLS] token with the email contents and feed it into the model. The embedding of [CLS] token in the last layer is adopted as the sentence embedding. Differently, the end of sentence (EOS) token is used as the sentence embedding in GPT-2.
We adopt in-context-learning and Chain-of-Thought as the LLM detection methods. In ICL, four randomly selected emails from previous dataset (Nigerian) are adopted as demonstrations to teach the LLM how to detect phishing emails. In CoT, the LLMs are required to think step by step with logical chains to make decision. The specifically designed phishing email detection prompts -- ChatSpamDetector - by Koide et al. (2024).
The dataset for implement the ML defenders and PLM defenders are shown in the link .
Experiments Results
For the defenders, to maximize the generalization of the trained model (not over-fitting on a specific dataset), we mix the six publicly released datasets and randomly split the data into 8:1:1 as the training set, validation set and test set. The validation set is used for model checkpoint selection and hyper-parameter selection.
In the test set, we still maintain the original dataset splits to check performance of the phishing attacks in each dataset. If the defenders detect the phishing emails with weak performance, the phishing attacks can bypass the defenders effectively.
For the PLM defenders, we adopt the officially released bert-base-uncased, gpt-2, and roberta-based-uncased as the backbone network, where we refer BERT, RoBERTa, and GPT-2 to these defenders. The model are trained on a single GPU (Tesla V100) using the Adam optimizer. We set the learning rate at 3e-5, with a linear scheduler. The batch size and the max sequence length are consistently set to 16 and 256 across all tasks. In LLM defenders, we adopt GPT-4 (gpt4-1106-preview) for the defenders because of its remarkable performance.
Some of the trained defenders can be found in the link .