Introduction
We designed an experiment to assess the effectiveness of PhishDecloaker on CAPTCHA-cloaked phishing websites in the wild. We utilized CertStream as a domain source, which is a platform that provides real-time updates from the Certificate Transparency (CT) Log Network, indicating newly issued SSL certificates for domains. PhishDecloaker extracts domains from these logs and crawls each domain for phishing analysis.
The flow of data through the system is as follows:
New URLs from the CertStream arrives
Crawler establishes a remote browser session to visit the URL.
Crawler uses webpage screenshot for CAPTCHA detection and recognition.
If CAPTCHAs are found:
4.1 Crawler passes remote session ID to CAPTCHA solver.
4.2 CAPTCHA solver connects to remote browser session.
4.3 Once solver completes/fails/time-out, notify crawler
Crawler reconnects to remote browser session
Crawler checks webpage for phishing
Crawler stores crawled data and analysis results into database
Figure: System design for field study. The parts marked in blue and red (CAPTCHA Detection, Recognition, Solver) are components from PhishDecloaker.
Experiment Setup
We designed a microservices system that combines CertStream + PhishDecloaker + PhishPedia & PhishIntention. The system operates synchronously to facilitate debugging, observation and research. The design choices are limited to the scope of this experiment. For practical suggestions on scaling up the system for production use-cases, please see Home > Scalability.
1.) Number of CAPTCHA-Cloaked Phishing Websites
We conducted the experiment for about 3 weeks and crawled about 500,000 zero-day websites, The system detected 1024 websites using CAPTCHA, of which 175 websites are reported as CAPTCHA-cloaked phishing websites. The tables below show various statistics related to experiment.
2.) CAPTCHA Detection Performance & Overhead
Table 1: Confusion Matrix for CAPTCHA detection.
First, we analyze the performance of PhishDecloaker's detection and recognition models on webpages in-the-wild. Through inspecting all webpages with a positive CAPTCHA prediction, we can determine true and false positives. On the other hand, true and false negatives are estimated by sampling and inspecting a subset of webpages with negative CAPTCHA prediction. From Table 1, the precision and recall are calculated as 0.85 and 0.92, respectively.
3.) Distribution of CAPTCHA Types
Figure 2a and 2b show the occurrence of each CAPTCHA type among captured benign webpages and CAPTCHA-cloaked phishing webpages, respectively. We observe that phishing websites prefer to use convenient and free CAPTCHA services such as reCAPTCHAv2 and hCaptcha.
4.) Distribution of Phishing Categories
The distribution of phishing categories employing CAPTCHA-cloaking reveals a varied landscape. Among the observed cases, logistics (Canada Post, UPS, DHL etc.) emerges as the most targeted sector with 102 instances, which aligns conspicuously with specific periods of increased online activity and consumer transactions (i.e., Black Friday and Christmas) during the experiment. Banking follows closely behind with 21 instances, possibly indicating the adaptive nature of phishing threats against high-valued targets (i.e., financial infrastructure). There are 9 recorded cases in the entertainment and gambling sector, while government, technology, cryptocurrency, e-commerce and telecommunication sectors face comparatively lower numbers. We were not able to determine the branding of 21 instances (categorized as "unknown") as they were taken down by the time of analysis; however, these cases were flagged by VirusTotal (i.e., from community reports) as either phishing or malicious.
5.) Factors Affecting PhishDecloaker's Performance
Misclassification of CAPTCHAs Resembling Non-CAPTCHA Elements
The main factor affecting PhishDecloaker's performance is the misclassification of CAPTCHAs resembling non-CAPTCHA elments. Non-CAPTCHA elements (e.g., logos, buttons) were erroneously detected as text-based and press & hold CAPTCHAs, whereas new text-based CAPTCHA variants went undetected. To reduce false positives on CAPTCHAs, we suggest incorporating other heuristics to further confirm their presence on the page. For example, by checking for input fields in the vicinity of the detection. We noticed that text-based CAPTCHA are commonly used by non-English-language websites, and press & hold CAPTCHAs can be traced to commercial anti-bot protection services (i.e., HUMAN, PerimeterX). Our analysis did not reveal any instances of phishing websites using the aforementioned CAPTCHA types.
Out-of-distribution CAPTCHA Challenges
hCaptcha periodically releases new challenge types that are conceptually different from its existing challenges. While it is possible to actively train deep learning models against each new challenge type, it is inefficient as some challenge types are temporary (i.e., for A/B testing and feature rollout). Currently, we instruct our solver to skip the new challenges and request new ones until a compatible challenge is found.
Custom CAPTCHA Variants
Phishdecloaker's performance is affected by custom CAPTCHA variants. While the system can identify and categorize these CAPTCHAs, its could not solve them due to the absence of corresponding solvers or the CAPTCHA's incompatibility with existing solvers. For instance, we've encountered cases where phishers replicate the form and functionality of reCAPTCHAv2, designed to thwart signature based detections. Additionally, we've identified a case involving slider-based CAPTCHAs, with sliders modified to move diagonally instead of horizontally. Solving this CAPTCHA redirects us to a phishing page posing as Bet365. To effectively address custom CAPTCHAs, we propose two potential directions: 1) Develop a more robust model for out-of-distribution (OOD) CAPTCHA detection, maintain a set of common CAPTCHA solvers, and use human operators as a fallback for solving unknown CAPTCHAs, as in PhishDecloaker's case. 2) Strive towards the development of generalist agents for web crawling tasks.
Multiple Anti-Bot Protection Layers
Websites, whether phishing or benign, often have robust anti-bot protection, These protective measures encompass various aspects, including network information (i.e., IP blacklist, geo-blocking, TLS handshake), browser identity (i.e., user agents, request headers, canvas fingerprints, audio fingerprints, WebGL rendering, media devices), and behavioral data (i.e., visit frequency, mouse movements, keyboard presses). In such cases, despite answering the CAPTCHA challenge, our crawler will still be rejected for being suspicious. We believe that by partnering with anti-bot companies (i.e., Cloudflare's friendly-bot allowlist), and employing strategic workarounds (i.e., residential proxies), we can minimize the impact of other human authentication factors.