RQ1 Overall evaluation:
DynaPD phishing kits dataset: https://drive.google.com/drive/folders/1i5BoXruRF4VrP5VLZz03FXpzWygqdvRs?usp=sharing
RQ2 Ablation study:
Annotated Alexa dataset:
https://drive.google.com/drive/folders/1bFvu6SwoS5MyXY-A2XtAEEXkZzP2dH1f?usp=sharing
RQ4 Field study:
Field study dataset: https://drive.google.com/drive/folders/1sjLJnIEvhmSWcReFpjX5Cckla9FDHAaJ?usp=sharing
Public phishing dataset:
https://drive.google.com/drive/folders/1wxHSmj_9P2FhwbR75aosARp9q8VCBmf5?usp=sharing
Brand recognition model -- Brand inference:
OCR model: PaddleOCRV3, Image captioning model: BLIP-2.
The original multilingual version of PaddleOCRV3 can support 80 languages.To select the most appropriate language, we run OCR inference in multiple languages and choose the one that yields the highest confidence score.
For the Language Model (LLM), we employ "gpt-3.5-turbo-16k" due to its capability to handle longer input sequences. We set the temperature to 0 to generate more deterministic responses.
Brand recognition -- Validation:
This step is to ensure the validity of LLM's responses.
We check whether the domain's status code is 200 (alive) or 3xx (redirection).
Additionally, we perform a Google Image search for the "d's logo" and compare the top 5 retrieved logos with the logo displayed on the webpage. A match is confirmed if the logo-matching model returns a similarity score greater than 0.83 [1] (adjustable). This step validates that the returned domain genuinely represents the brand indicated on the webpage.
CRP prediction model:
This step is to recognize the credential-taking intention of the webpage.
The LLM settings are the same as the brand recognition model.
CRP transition model:
This step is to transit to a credential-taking page if the current page is not.
For the CRP transition model, we employ the CLIP model, which uses "ViT-B/32" (Vision Transformer) as its backbone.
The model is fine-tuned using Cross-Entropy loss at a learning rate of 1×10^(−5) for 5 epochs, on a dataset comprising 3,047 webpages. During this fine-tuning phase, the text encoder branch is completely frozen, allowing only the image encoder branch to participate in gradient updates.
References:
[1] Liu, R., Lin, Y., Yang, X., Ng, S. H., Divakaran, D. M., & Dong, J. S. (2022). Inferring phishing intention via webpage appearance and dynamics: A deep vision-based approach. In 31st USENIX Security Symposium (USENIX Security 22) (pp. 1633-1650).