Table of Contents
1. Target Web Application Selection: Our chosen web applications are consistent with those used in prior work and are widely adopted in the community. In particular, TrainTicket is the largest publicly available deployable system for this domain. It consists of 41 microservices, spans multiple programming languages and frameworks, and features a complex inter-microservice architecture, making it a highly representative and challenging benchmark.
2. Diversity of Attacks and Constraint Violations: Our experiments encompass a wide range of realistic and impactful attack types, including injection attacks, unauthorized data access, data constraint violations, and identity spoofing. From the perspective of invariant constraints, the attack scenarios are designed to violate various types of constraints—such as intra-API constraints, API-to-API constraints, API-to-database constraints, API-to-environment constraints, as well as their combinations. A detailed breakdown of attack categories and their corresponding counts is presented in the following table. Also from the perspective of the constraints level, our attack suite spans all three levels of violation granularity: column-level, tuple-level, and relation-level.
We have added an ablation study to investigate the performance gap between GPT-4o and GPT-4o-mini. Specifically, we manually analyzed 10 invariants that were successfully generated by GPT-4o but failed by GPT-4o-mini. Our findings indicate that the primary reason for GPT-4o-mini's poor performance is its limited code generation capability. Out of the 10 invariants that GPT-4o-mini failed to produce, 9 failures were due to incorrect or malformed code. Notably, in these 9 cases, the natural language descriptions of the invariants were correctly generated, but the model failed to translate them into executable code. This suggests that the issue lies primarily in code synthesis, rather than understanding or reasoning. We have incorporated the details of this analysis into the revised version of the paper.
All detailed prompts are available at [Link].
All examples are available at [Link].
In the ablation study of "w/o deducing", we employ LLMs to directly induce invariants from detailed logs. For a fair comparison, we further augment the raw logs by incorporating joined information into a combined log. This combined log includes not only the original data but also additional context extracted through join operations across related tables. An example of such a detailed log is shown below. In this ablation study, we replace schema-level information with these combined logs.
All experimental results are available at [Link].