To create a lightweight representation of URLs, a dataset composed of 52,000 URLs were collected:
26,000 were taken from the Alexa Top website and labelled as Legitimate. For obtaining the legitimate URLs, the domains have been passed through a Heritrix web crawler to extract the URLs, and later, the extracted URLs were checked through VirusTotal to filter the benign URLs.
26,000 URLs were obtained from PhishTank and were labelled as Phishing.