Length-related Features
Website’s owners know that a brand that can be easily remembered always be visited. Because of this, the website’s names, and therefore its URL, try to be straight, short, and concise. On the contrary, phishers try to masquerade their malicious URLs adding characters with the hope of confusing users. Because of this, the length of the phishing URLs are commonly larger than legitimate ones. Starting from this fact, several features that measure the length of some URL’s parts were considered:
F1 : Length of the URL.
F2 : Length of the sub-domain.
F3 : Length of the domain.
F4 : Length of the Top-Level-Domain.
F5 : Length of the hostname.
F6 : Length of the longest token in the URL.
F7 : Length of the shortest token in the URL.
F8 : Average tokens length.
Counting-related Features
A common masquerading technique used by phishers to confuse users is changing some characters to digits in the URL, or introducing some other symbols. Therefore, counting certain symbols can help to discriminate between phishing and legitimate URL. These counting-related features are:
F9 : Number of digits in the URL.
F10 : Numbers of tokens in the sub-domain.
F11 : Number of tokens in the domain.
F12 : Number of tokens in the Top-Level-Domain.
F13 : Number of special characters in the hostname.
F14 : Number of slashes in the URL.
F15 : Number of Unicode characters in the URL.
F16 : Number of dots in the URL.
F17 : Number of hyphen in the hostname.
F18 : Number of parameters in the query.
F19 : Number of subdirectories in the path.
F20 : Number of digits in the hostname.
F21 : Number of letters in the hostname.
F22 : Number of symbols in the hostname.
HTTP/S-related Features
The URL contains more than names and parameters: the URL involves commands to the web server that indicates how to interpret the web request and how to return the response. For example, using the protocol part of the URL, the web server knows if the web request is ciphered (HTTPS) or is plaintext (HTTP). Phishers use those technical subterfuges of the URL to introduce their attacks, e.g., introducing IP addresses or uncommon port numbers to confuse the users. HTTP/S-based features are:
F23 : Indicates if the URL contain IP address.
F24 : Determining if the URL is HTTP or HTTPS.
F25 : Indicates if the URL contains executable files.
F26 : Number of the used port (if it is indicated in the URL)
Natural Language Processing-related Features
URLs are strings of characters that indicates unique addresses of some resources on the Internet. As URLs are a relatively short string of characters, it can be considered as typical sentences and therefore, they can be processed using text classification techniques. The text classification task consists to determine the class of some text extracting some features to represent them. To take advantage of this technique, it is proposed adding textual features extracted from the URLs strings. Some of these features are:
F27 : Number of Phishing class keywords in the URL.
F28 : Number of Legitimate class keywords in the URL.
F29 : Entropy of the URL.
F30 : Entropy of the hostname.
F31 : Entropy of the path.
F32 : Phishing mutual information.
F33 : Legitimate mutual information.
Rate-related Features
From the characteristics of URLs seen so far, it was noticed that URLs have predictable patterns. Some of these patterns are rates among URL features, e.g., the ratio among the domain length and the URL length. These rates were measured and used for URLs representation.
F34 : Rate among the length of the domain and the total length of the URL.
F35 : Rate among the length of the subdomain and the total length of the URL.
F36 : Rate among the length of the hostname and the total length of the URL.
F37 : Rate among the length of the path and the total length of the URL.
F38 : Rate among the length of the arguments and the total length of the URL.
F39 : Rate among the length of the path and the total length of the domain.
F40 : Rate among the length of the arguments and the total length of the domain.
F41 : Rate among the length of the arguments and the total length of the path.
F42 : Number of the Letter-Digit-Letter sequences.
F43 : Number if the Digit-Letter-Digit sequences.
F44 : Rate among the number of letter and the total length of the hostname.
F45 : Rate among the number of digits in the hostname and the total length of the hostname.
F46 : Rate among the number of symbols in the hostname and the total length of the hostname.