A11YBench is a curated benchmark designed to evaluate automated systems that repair real-world Web Accessibility (A11Y) violations. It contains diverse, production-quality web projects along with violation reports, DOM snapshots, and execution environments that enable reproducible end-to-end evaluation.
The overview of A11YRepair.
1️⃣-2️⃣ It first groups all violations at the component and situation levels for fault localization and patch generation. 3️⃣-4️⃣ Then, it leverages chat and embedding models to locate buggy files and synthesize patch edits. 5️⃣ Meanwhile, the knowledge integration module selectively incorporates WCAG guidelines for fault localization and patch generation when necessary.
This section explains how to reproduce all experiments presented in the paper, including the main study, ablation studies, and generalizability tests.
To begin, download and install all repository instances provided in [A11YBench](https://sites.google.com/view/a11yrepair/a11ybench).
You may place the downloaded GitHub project repositories under:
A11YBench/Project/GitHub
After placing them, follow the installation commands (Install CMD) included inside each A11YBench project to install dependencies and launch the corresponding website.
All experiments can be reproduced using the provided shell scripts. Below we summarize the commands used for each research question.
RQ1: Overall Effectiveness
To run the main experiment that repairs Web A11Y violations for all target projects:
bash 1_main_study.sh
This script runs the full A11YRepair pipeline end-to-end.
RQ2: Ablation Study
These experiments evaluate the contribution of each system component.
1.2.1 Impact of Violation Grouping
1) Grouping Granularity.
A11YRepair_com vs. A11YRepair_cri vs. A11YRepair_sit
bash 2_ablation_study_1_A11YRepair_com.sh
bash 2_ablation_study_1_A11YRepair_cri.sh
bash 2_ablation_study_1_A11YRepair_sit.sh
2) LLM-based Refining.
A11YRepair_wor vs. A11YRepair_wr
bash 2_ablation_study_1_A11YRepair_wor.sh
bash 2_ablation_study_1_A11YRepair_wr.sh
1.2.2 Impact of Fault Localization
1) Feature Retrieval.
A11YRepair_woe vs. A11YRepair_we
bash 2_ablation_study_1_A11YRepair_woe.sh
bash 2_ablation_study_1_A11YRepair_we.sh
2) Locate Reflection.
A11YRepair_wol vs. A11YRepair_wl
bash 2_ablation_study_1_A11YRepair_wol.sh
bash 2_ablation_study_1_A11YRepair_wl.sh
1.2.3 Impact of Knowledge Integration
A11YRepair_nw vs. A11YRepair_aw vs. A11YRepair_sw
bash 2_ablation_study_1_A11YRepair_nw.sh
bash 2_ablation_study_1_A11YRepair_aw.sh
bash 2_ablation_study_1_A11YRepair_sw.sh
RQ3: Generalizability Study
To evaluate A11YRepair with different LLM backbones:
bash 3_gen_study_o4_mini.sh
bash 3_gen_study_gpt41_mini.sh
bash 3_gen_study_gpt5_mini.sh
We provide all patches generated by A11YRepair under different model configurations, along with detailed evaluation reports.
These include Web A11Y reports, patches, and execution logs for reproducible analysis.
In terms of parameter settings, we follow the Agentless and Agentless Lite experience. More details please see the bash scripts (e.g., 1_main_study.sh). We strictly control that A11YRepair and all baselines use the same parameter settings to ensure the fairness of the experiment. More details as below:
1️⃣ Fault Localization Grouping.
In the initial size-based grouping stage, we set the default grouping size to 1920*1080 to cluster violation elements on the current webpage. During the subsequent LLM-based refining stage, the model temperature is set to 0 with a single sampling run, and the maximum number of refining rounds is limited to one.
2️⃣ Patch Generation Grouping.
We adopt IBM Accessibility Checker Rule (version 2025.09.03) as the criterion for clustering violation elements based on the violated rules. Each rule is further mapped to its corresponding WCAG 2.2 guideline to support rule-aware patch generation.
3️⃣ Fault Localization.
For the File Localization step, we set the model temperature to 1 with two sampling runs, requiring the model to return all potentially buggy source files.
In the Locate Reflection step, the temperature is set to 0 with a single sampling run, and the maximum number of reflection iterations is capped at two.
4️⃣ Patch Generation.
During patch generation, the model temperature is set to 1 with two sampling runs, and the model is instructed to output patches in a search/replace format.
5️⃣ Knowledge Integration.
We use WCAG 2.2 as the knowledge base. For both the Necessity Analysis and Technical Selection stages, the model temperature is set to 0 with a single sampling run.