Experiment Structure

Dataset Structure

dataset-structure.pdf

Detailed Description with Download Link




  • CRP transition detector (Deep Learning part)

    • 1210 test set = 445 non-CRP phishing webpage + 765 non-CRP benign webpage dataset: Sampled from Phishpedia dataset

    • 4843 training set = 1774 non-CRP phishing webpage + 3069 non-CRP benign webpage dataset: Sampled from Phishpedia dataset

    • 10k pre-training set = 10k websites with pasted CRP buttons: 10k are all from benign, some data are repeated because they are pasted with different CRP button at different locations


  • CRP classifier and AWL detector

    • 901 test set = 901 webpages labelled with layout and CRP class: Sampled from Phishpedia dataset

    • 8109 training set = 8109 webpages labelled with layout and CRP class: Sampled from Phishpedia dataset


  • OCR-aided Siamese model

    • 2000 test set: 2000 logos cropped from 1k benign + 1k phishing: Cropped from Phishpedia dataset

    • 3061 training set: Logo targetlist: From Phishpedia logo targetlist, 277 brand logos, also used as reference logo matching list in real deployment

    • 167,140 pre-training set: Logo2k+ dataset: from Logo2k+ (AAAI'20) paper


Reason summary for CRP transition locator failure cases

From the table, the most common reason for deep model to fail (12 out of 67) is the use of uncommon login keywords/icons.


The rest are due to website design (e.g. popup, complicated design etc.) and interaction error (e.g. connection timeout, Selenium/Helium problem etc.). Fixing them bring trivial improvement while may increase runtime. Therefore, we do not further tackle these failure cases. We leave the refinement as future work.

| Reason Category | Count |

|-----------------------------------------------------------|-------|

| Connection Timeout/Website died | 17 |

| Cannot interact properly | 4 |

| Uncommon/No login-icon and no keyword | 12 |

| Popup window blocks interaction | 22 |

| No login button, noisy data | 1 |

| Selenium eager mode does load page completely | 2 |

| Website detects I am an engine | 8 |

| Uncommon language (Persian language), unable to translate | 1 |

| Total failure cases | 67 |