We release datasets used in training and testing both the detection and recognition model
CAPTCHA Detection Dataset
Usage: Training and testing the CAPTCHA detection model
Contents: 19,680 webpage screenshots, with 10,680 of them having annotated CAPTCHA bounding boxes, and the remaining 9,000 without annotations (negative examples). Sourced from both the Alexa top-1 million websites and synthetic data generation.
CAPTCHA Recognition Dataset
Usage: Training and testing the CAPTCHA recognition model.
Contents: 6,612 CAPTCHA images distributed across 38 classes. Sourced from scraping demo websites, using official API keys provided by vendors, and collecting datasets contributed by the community.
CAPTCHA Open-set Dataset
Usage: Open-set testing on Phishdecloaker.
Contents: 1,500 webpage screenshots, all of which have annotated CAPTCHA classes spanning 15 unseen categories. Synthetically generated.