Experiment Structure
Dataset Structure
Detailed Description with Download Link
Experiment dataset
25K benign webpage dataset + 25K CRP phishing webpage dataset (should filter out these list of non-CRP websites) : See Table 1, from Phishpedia dataset
3049 misleading legitimacy dataset: Collected in one week (from Apr 9, 2021 to Apr 16, 2021) from Alexa top30k - top50k, with human verification based on 3 conditions defined in Section 9.1.
CRP transition locator (hybrid) evaluation set
3310 phishing non-CRP webpage dataset: See Table 1, sampled from Phishpedia dataset
1003 wild benign non-CRP webpage dataset: Main pages collected online from 1k well-known brands
CRP transition detector (Deep Learning part)
1210 test set = 445 non-CRP phishing webpage + 765 non-CRP benign webpage dataset: Sampled from Phishpedia dataset
4843 training set = 1774 non-CRP phishing webpage + 3069 non-CRP benign webpage dataset: Sampled from Phishpedia dataset
10k pre-training set = 10k websites with pasted CRP buttons: 10k are all from benign, some data are repeated because they are pasted with different CRP button at different locations
CRP classifier and AWL detector
901 test set = 901 webpages labelled with layout and CRP class: Sampled from Phishpedia dataset
8109 training set = 8109 webpages labelled with layout and CRP class: Sampled from Phishpedia dataset
OCR-aided Siamese model
2000 test set: 2000 logos cropped from 1k benign + 1k phishing: Cropped from Phishpedia dataset
3061 training set: Logo targetlist: From Phishpedia logo targetlist, 277 brand logos, also used as reference logo matching list in real deployment
167,140 pre-training set: Logo2k+ dataset: from Logo2k+ (AAAI'20) paper
Reason summary for CRP transition locator failure cases
From the table, the most common reason for deep model to fail (12 out of 67) is the use of uncommon login keywords/icons.
The rest are due to website design (e.g. popup, complicated design etc.) and interaction error (e.g. connection timeout, Selenium/Helium problem etc.). Fixing them bring trivial improvement while may increase runtime. Therefore, we do not further tackle these failure cases. We leave the refinement as future work.
| Reason Category | Count |
|-----------------------------------------------------------|-------|
| Connection Timeout/Website died | 17 |
| Cannot interact properly | 4 |
| Uncommon/No login-icon and no keyword | 12 |
| Popup window blocks interaction | 22 |
| No login button, noisy data | 1 |
| Selenium eager mode does load page completely | 2 |
| Website detects I am an engine | 8 |
| Uncommon language (Persian language), unable to translate | 1 |
| Total failure cases | 67 |