There are many products that can notify a user if a website the navigate to is malicious, but they usually focus on one piece of the puzzle. This is sometimes Universal Resource Locator (URL) construction, metadata analysis, or natural language processing. There are many types of information that can be pulled from URLs and web-based content. Determining the usefulness of this information helps provide the features necessary for accurate and efficient classification. The customer requested a single solution to cover a multitude of detection options rather than using the traditional black-list reactive approach. This project is Phase I: Malicious URL Detection.
Methodology:
Data Ingest
Under-Sampling
Feature Extraction
Feature Encoding
Modeling & tuning
Evaluation
Crafted URLs are often used to trick users into visiting websites that can collect their information, infect their devices, and/or download malware/spyware. A carefully constructed URL obfuscates the true nature of the site, usually through hiding behind familiar looking websites, or by making them so complicated it is hard to tell what they mean. Often this effort is unnecessary, since the average user does not look at a URL before clicking. The goal for this project is to create a machine learning model that can predict whether or not a URL leads to a malicious website. In addition to URL based detection, Punctuation and Content Analysis will be used to analyze the content of a webpage, including HyperText Markup Language (HTML), JavaScript, and text. The results of both URL and Content analysis will provide classification models that will be applicable to a much larger, unlabeled, dataset. Future data will be processed in real time on an NVIDIA DGX a100 Graphics Processing Unit (GPU) server in order to create a real-time application to protect users and business from fraud and other criminal activity.
Research Questions:
1) What features can be extracted from URL data
2) What degree of success will this application achieve?
3) Which machine learning models perform better on this type of data?
4) Which features are more useful for analysis?
5) Will a combination of URL and Content based features improve classification models?
There are many studies available on this topic, all of which use slightly different approaches for their proposed solution. Each study depends on binary classification using various subcategories of attributes, ranging from a small number to over 80. Some key takeaways were:
categories of features
content based
url based
html based
lexical
the volatility of malicious websites makes it difficult to maintain up to date datasets and blacklists
combinations of feature types work the best
features are often repetitive and can be weeded out
datasets are easy to find but not all are equal
there needs to be a real time solution
It is interesting to note that the result of the HTTPS connection request do not match the actual text of the URL. The first URL in the dataset is 'http://members.tripod.com/russiastation/'. It is not malicious but I would not recommend clicking the link just in case. The text says the URL uses HTTP, but the connection request returned a value that marked it as using HTTPS.
This dataset was collected from Mendeley as two zip files. The first contained the training dataset and was 723MB. Once unzipped it was 1.97GB and contained 1,200,000 URLs and their associated features. The second contained the test dataset and was 218MB, 605.6MB after unzipping. It contained 361,933 URLs and their associated features. The original dataset contains 12 attributes described in the chart on the left.
Creation Process:
Web scrape using MalCrawler
Calculate length of URLs
Use IP address of webserver to get country of origin using GeoIP
Extract and count the length of code between the <script></script> tags for the 'js_len' feature
Obfuscated JS decoded using JavaScript Auto De-Obfuscator (JSADO) and Selenium Python library
Retrieve the top level domain using the Tld library
Retrieve WHOIS completion using the WHOIS API
HTTPS status found by making an HTTPS connection to the URL to record the response. A response of 200, 301, or 302 was recorded as HTTPS=True, otherwise HTTPS=False
Generate class labels using Google Safe Browsing API
All of this code is hosted in the Mendeley repository linked above, and Kaggle for open use.
Class Label Distribution
This dataset is highly unbalanced. There are 1,172,747 benign observations and 27, 254 malicious observations in the training data. There are 353,872 benign observations and 8,062 malicious observations in the test data.
Random Under Sampling
To solve this class distribution problem, I selected a random sample of 'good' observations equal to the number of 'bad' observations for both training and test data. This resulted in a training dataset with 27,253 'good' and 'bad' observations. The sampled test data has 8,062 'good' and 'bad' observations. While this does solve the class imbalance, it does run the risk of removing some important information.
Location Analysis - Training Data
Location Analysis - Test Data
The distribution of URL locations is similar between the training and test data. The maps do not distinguish between benign and malicious websites. The countries with the highest number of observations also have the highest number of malicious websites. The country names were converted to their two letter (alpha2) ISO code using pycountry-convert. The alpha2 codes were then converted to the GPS coordinates for the center of the country using the Nominatim API.
Note: Nominatim defaults to United States as the conversion metric for determining coordinates. A country with a two letter abbreviation of MN would plot in Minnesota instead of Mongolia unless specified that you want to use the country name instead of the state name. Also of note, when specifying what country to use when there is a conflicting return value, it must be lowercase (the return of pycountry-convert is uppercase).
Protocol Confusion
The chart on the left shows the good and bad URL counts based on whether or not they were using HTTP or HTTPS in the training data. The one on the right shows the same information based on the URLParse extraction of the scheme, which is the protocol used. Further research is needed to determine which label is the correct one.
Correlation Between Features
'js_len' and 'js_obf_len' are highly correlated. Normally I might drop one of them, but both have a pretty high feature importance for the Random Forest Classifier as demonstrated below.
there is also a high positive correlation between 'js_len' and 'num_hplinks', 'num_frame', 'numscript', 'num_form', and 'who_is_le'
there is a high negative correlation between 'js_len' and 'https_le'.
There are correlations between other variables as well, mostly seen in the upper left corner of the variable correlation heatmap on the right.
Note: any variable with 'le' at the end is a label encoded variable. Anything with 'enc' at the end is encoded by a different method. 'country' is the encoded 'geo_loc' feature.
Slight variation in URL Length but not noticeable enough to be statistically significant.
Malicious JavaScript (JS) length starts where benign stops. Heavy shift indicates the length of JS will influence predictions.
Note that the benign side could not create a kde plot. There was no variance in the data, indicating that only malicious websites have obfuscated JS.
URL Parse
URLParse is used to break a URL into it's 6 basic components. The URL above outlines 4 of those. All 6 can be found in the table to the right.
New Features
The features in the table to the right were extracted by a keyword/character search and count.
num_hplinks counted the occurance of the string 'href' in the 'content' feature
num_hplinks counted the string 'iframe' in the 'content' feature
num_form counted the string 'form' in the 'content' feature
url_spc_char_cnt counted the occurance of these special characters in total:
'[', ']', ':', '=', '&', '~', '%', ';', '!', '+', '_', '?'
Encoding
'who_is', 'https', and 'scheme' were label encoded
'label' was manually encoded so that 0 is good and 1 is bad
'geo_loc' and 'tld' were mean encoded
the remaining features were encoded to be 0 if there was no content, 1 if there was
Base Model Metrics
The function returned the GB model as the best performing based on test accuracy at 99.55%. It also had the second highest precision at 99.76% and the third highest recall at 99.33%. RF performed the second best with a 99.54% test accuracy, the third highest precision at 99.65%, and the second highest recall of 99.44%. KNN was third with a 99.37% accuracy, 99.26% precision, and 99.48% recall. It performed .04% better than RF in terms of recall. These three models, along with SVC, since it had a near perfect precision were selected for hyperparameter tuning.
The confusion matrices and classification reports generated for the four selected models demonstrate that SVM has the fewest false positives, then GB, RF, and then KNN. However, SVM has the most false negatives by a large margin, then GB, KNN, and RF.
Hyperparameter Tuning
GridSearchCV was used to test small ranges of hyperparameter values. The values selected were then used for a finally train and valididation. Random Forest had the best performance, followed by Gradient Boosting.
Feature Importance using our Random Forest Classifier indicates 'js_len', 'js_obf_len', and 'num_script' have the most importance when determining whether a webpage is malicious or benign.
Below are the contribution scores for how much each feature contributed to the prediction of a given observation. On the left is an observation that was correctly predicted to be benign. The one on the right displays the contributions for an observation correctly predicted to be malicious.
Benign Prediction (Observation 2)
websites are less likely to be malicious if they do not have obfuscated JS code.
Malicious Prediction (Observation 3)
websites are more likely to be malicious if they do have obfuscated JS code.
The ROC curve indicates our model has a near perfect trade-off between sensitivity and specificity.
This Precision Recall curve also shows near perfect trade off between precision and recall at an almost 90 degree angle. It's a little hard to see without zooming in, but the recall veers slightly inward from one in order to meet the precision line.
Final Model Metrics
SVC saw the largest increase in accuracy, though this is largely due to the use of the scaled data. The scaled versions performed slightly worse for RF and GMB, and there was no difference for KNN. SVM had the overall worse performance, even after tuning and scaling, suggesting that it is not suitable for this data problem. KNN saw the largest test accuracy and precision increases after tuning, though RF and GB still had higher scores. Tuning had minimal influence on the GB scores. Overall, RF outperformed GB in accuracy and recall, scoring 99.59% and 99.50% respectively. GB scored higher for precision, though their F1 scores were only .05% apart.
RF trained and tested in less than half the time of GB, though GB only required 3.83 seconds. Random Forest had more false positives (FP), and GB had more false negatives (FN). FPs were websites that were predicted to be benign that were truly malicious, and the FNs are websites that were predicted as malicious but were truly benign.
Conclusion:
Overall, the goal of correctly predicting malicious websites based on URL characteristics, content, and text features was a success. The scores achieved by several of these models outperformed the metrics provided in various other studies. Though this is an exciting prospect. Further evaluation and peer review is needed to verify no bias was unintentionally introduced to the model. This will be achieved in future work in the hopes that this study can be built upon to create a real-time application to protect users against malicious activity.
Natural Language Processing (NLP) on text content to see if sentiment analysis, topic modeling, or metrics like profanity or polarity scores effect predictions
Remove 3rd party retrieved features to reduce reliance on outside sources and improve real-time processes
Replicate the data retrieval outlined in the data source documentation to retrieve full text content to perform own feature extraction and analysis
replicate several steps of the original data collection process using the Common Crawl API that contains a massive unstructured dataset containing petabytes of raw web content, including metadata and text dating back to 2008
advanced feature engineering to identify and extract the most relevant features for classification to reduce dimensionality
Figure out which version of the protocol information is correct (HTTP vs HTTPS)
Experiment to see if mean encoding the country and 'tld' made them too correlated to the target variable
Resume
Project Plan/Disaster Recovery
A.K., Singh. “Malicious and Benign Webpages Dataset.” Data in Brief, Elsevier, 12 Sept. 2020, https://www.sciencedirect.com/science/article/pii/S2352340920311987#ec-research-data.
Balogun, Abdullateef O., et al. “Improving the Phishing Website Detection Using Empirical Analysis of Function Tree and Its Variants.” Heliyon, vol. 7, no. 7, July 2021, p. e07437., https://doi.org/10.1016/j.heliyon.2021.e07437.
Barraclough, P.A., et al. “Intelligent Cyber-Phishing Detection for Online.” Computers & Security, vol. 104, May 2021, p. 102123., https://doi.org/10.1016/j.cose.2020.102123.
Hannousse, Abdelhakim, and Salima Yahiouche. “Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An Experimental Study.” Engineering Applications of Artificial Intelligence, vol. 104, Sept. 2021, p. 104347., https://doi.org/10.1016/j.engappai.2021.104347.
Sánchez-Paniagua, Manuel, et al. “Phishing Websites Detection Using a Novel Multipurpose Dataset and Web Technologies Features.” Expert Systems with Applications, vol. 207, Nov. 2022, p. 118010., https://doi.org/10.1016/j.eswa.2022.118010.
Moedjahedy, Jimmy, et al. “CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning.” Future Internet, vol. 14, no. 8, July 2022, p. 229. Crossref, https://doi.org/10.3390/fi14080229.