Phishpedia: A Hybrid Deep Learning Based Approach to Visually Identify Phishing Webpages (appear in USENIX 21)

In this work, we propose an explainable phishing identification system, Phishpedia, which (1) achieves both high identification accuracy and low runtime overhead, (2) provides causal visual annotation on the phishing webpage screenshot, and (3) does not require training on any phishing samples Phishpedia infers the intended brand from the webpage screenshot of an URL, and reports phishing based on alignment of intended brand domain and the landing domain of the URL.

Phishpedia significantly outperforms baseline identification approaches (URLNet, StackModel, PhishCatcher, EMD, PhishZoo, and LogoSENSE) with respect to identification accuracy and runtime overhead. We deployed Phishpedia with emerging new domains fed from CertStream service and discovered 1704 phishing websites (including 1133 new zero-day phishing websites) within one month, significantly outperforming existing solutions.

See our Github repository and Paper for details.

Overview

Input: A URL and its screenshot Output: Phish/Benign, Phishing target

  • Step 1: Enter Deep Object Detection Model, get predicted logos and inputs (inputs are not used for later prediction, just for explaination)

  • Step 2: Enter Deep Siamese Model

    • If Siamese report no target, Return Benign, None

    • Else Siamese report a target, Return Phish, Phishing target

Phishing example

Phishing Discovery Results

Each DATABASE folder comes with a readme.csv to facilitate the user from matching the phishing url with the folder path to open to screenshot (shot.png)

Link to download: https://drive.google.com/drive/folders/1X1xP0jiOfR7DcT3Mba-m0OdZ4gyowHhc?usp=sharing

Database ReadME

The database presents the results of phishing discovery experiment. The folder contains the found real phishing of EMD, PhishCatcher, Phishzoo, StackModel, URLNet, and Phishpeida.

For each tool, we show the html webpage, url name, and the screeshot of a webpage.

Found phishing results

We list a few found phishing webpages by Phishpedia here.

Baseline Approaches

Please find the code for all baselines here: https://drive.google.com/drive/folders/1YpKR_Nye4E11FCbPbePAAJG4UcqkIsfZ?usp=sharing

  • EMD (general experiment, phishing discovery experiment)

  • Phishzoo (general experiment, phishing discovery experiment)

  • LogoSENSE (general experiment, phishing discovery experiment)

  • StackModel (phishing discovery experiment)

  • URLNet (phishing discovery experiment)

Targetlist Dataset

181 protected brands, Link to download: https://drive.google.com/file/d/1zxvXFKpLx816VfaGFISL6tod-zSEc6hY/view?usp=sharing

Phishing Dataset

29496 phishing sites, Link to download: https://drive.google.com/file/d/12ypEMPRQ43zGRqHGut0Esq2z5en0DH4g/view?usp=sharing

Phishing Dataset targeting for 5 brands

(Bank of America, Chase Personal Banking, DHL Airways Inc., Microsoft, Paypal Inc.)

Link to download: https://drive.google.com/file/d/1EJnx9oX9wQieF7UPQJeTVg850nZsuxTi/view?usp=sharing

Labelled Logo Dataset

30649 benign dataset with ground-truth logo labels, Link to download: https://drive.google.com/file/d/1L3KSWEXcnWzYdJ4hPrNEUvC8jaaNOiBa/view?usp=sharing