Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages (appear in USENIX 21)
In this work, we propose an explainable phishing identification system, Phishpedia, which (1) achieves both high identification accuracy and low runtime overhead, (2) provides causal visual annotation on the phishing webpage screenshot, and (3) does not require training on any phishing samples Phishpedia infers the intended brand from the webpage screenshot of an URL, and reports phishing based on alignment of intended brand domain and the landing domain of the URL.
Phishpedia significantly outperforms baseline identification approaches (URLNet, StackModel, PhishCatcher, EMD, PhishZoo, and LogoSENSE) with respect to identification accuracy and runtime overhead. We deployed Phishpedia with emerging new domains fed from CertStream service and discovered 1704 phishing websites (including 1133 new zero-day phishing websites) within one month, significantly outperforming existing solutions.
See our Github repository and Paper for details.
Overview
Input: A URL and its screenshot Output: Phish/Benign, Phishing target
Step 1: Enter Deep Object Detection Model, get predicted logos and inputs (inputs are not used for later prediction, just for explaination)
Step 2: Enter Deep Siamese Model
If Siamese report no target, Return Benign, None
Else Siamese report a target, Return Phish, Phishing target
Phishing example
Phishing Discovery Results
Each DATABASE folder comes with a readme.csv to facilitate the user from matching the phishing url with the folder path to open to screenshot (shot.png)
Link to download: https://drive.google.com/drive/folders/1X1xP0jiOfR7DcT3Mba-m0OdZ4gyowHhc?usp=sharing
Database ReadME
The database presents the results of phishing discovery experiment. The folder contains the found real phishing of EMD, PhishCatcher, Phishzoo, StackModel, URLNet, and Phishpeida.
For each tool, we show the html webpage, url name, and the screeshot of a webpage.
Found phishing results
We list a few found phishing webpages by Phishpedia here.
Baseline Approaches
Please find the code for all baselines here: https://drive.google.com/drive/folders/1YpKR_Nye4E11FCbPbePAAJG4UcqkIsfZ?usp=sharing
EMD (general experiment, phishing discovery experiment)
Phishzoo (general experiment, phishing discovery experiment)
LogoSENSE (general experiment, phishing discovery experiment)
StackModel (phishing discovery experiment)
URLNet (phishing discovery experiment)
Targetlist Dataset
181 protected brands, Link to download: https://drive.google.com/file/d/1zxvXFKpLx816VfaGFISL6tod-zSEc6hY/view?usp=sharing
Phishing Dataset
29496 phishing sites, Link to download: https://drive.google.com/file/d/12ypEMPRQ43zGRqHGut0Esq2z5en0DH4g/view?usp=sharing
Phishing Dataset targeting for 5 brands
(Bank of America, Chase Personal Banking, DHL Airways Inc., Microsoft, Paypal Inc.)
Link to download: https://drive.google.com/file/d/1EJnx9oX9wQieF7UPQJeTVg850nZsuxTi/view?usp=sharing
Benign Dataset
30649 benign sites, Link to download: https://drive.google.com/file/d/1yORUeSrF5vGcgxYrsCoqXcpOUHt-iHq_/view?usp=sharing
Labelled Logo Dataset
30649 benign dataset with ground-truth logo labels, Link to download:
https://drive.google.com/file/d/1yORUeSrF5vGcgxYrsCoqXcpOUHt-iHq_/view?usp=sharing
https://drive.google.com/file/d/1bH3Yp6K1B37B_sS_MNMz7yvYcOhOu-J8/view?usp=sharing
https://drive.google.com/file/d/1u56I0IHBgM9glNJl2wcLfaihp1L_U7eD/view?usp=sharing