Background
CAPTCHA-cloaking is a new technique on phishing websites to avoid detection. This technique is associated with two issues:
Prevalence: the number of CAPTCHA-cloaked phishing sites grew almost tenfold from January 2023 to June 2023.
Severity: none of the existing phishing detection tools can identify CAPTCHA-cloaked phishing websites.
Introduction
In this work, we develop PhishDecloaker, a hybrid deep-vision system to detect, recognize, and solve diverse CAPTCHAs, which can enhance existing SOTA detectors for identifying CAPTCHA-cloaked phishing websites. Our experiments show that PhishDecloaker:
Recovers detection rates of phishing detectors from 0% to an average of 74.25% on CAPTCHA-cloaked phishing sites
Generalizes to unseen CAPTCHAs with an average precision and recall of 86% and 69%
Remains robust against various evasion attacks, including FGSM, JSMA, PGD, DeepFool, and DPatch.
Overview
Input: webpage URL
Step 1. CAPTCHA Detection: an Object Localization Network (OLN) detector.
Step 2. CAPTCHA Recognition: an OCR-aided Metric Learning network.
Step 3. CAPTCHA Solving: an arsenal of CAPTCHA solvers that utilize deep-vision and browser automation.
Output: a list of CAPTCHA bounding boxes (x_min, y_min, x_max, y_max), CAPTCHA types (string), solve statuses (boolean)
Figure 1: PhishDecloaker's system design
Scalability
For practical deployment, we suggest that the web browser must be decoupled from the crawler (i.e., as containerized services). In this scenario, we load each URL as an individual session in the browser. The crawler remotely connects to each session to crawl and extract information from the loaded page. If CAPTCHAs are detected on the page, the crawler sends a request to a solver, disconnects from the session and moves on to crawl other sessions in the browser. After receiving the request, the solver remotely connects to this browser session and interacts with the CAPTCHA. Once the CAPTCHA is solved, the solver sends a reply to the crawler. This asynchronous request-reply can be implemented with message queues (e.g., Apache Kafka, RabbitMQ, Celery). The crawler, browser, and solver clusters can be scaled up and routed behind load-balancers depending on needs, efficiently separating the page rendering, crawling and CAPTCHA-solving load.