Rebuttal Details

Table of Contents

Video Demostration

RA-Q1: Demographic Analysis of Attack Types

RA-Q3: Any ablation study to figure out the impact of different LLMs?

RB-Q1: Prompts

RC-Q2 Mapping Table Between WebNorm and MINES

RA-Q1': Examples of generated invaraints and detected anomalies.

RA-Q3': Generating from Log Instances

RB-Q5': Experimental Results for All Datasets

Video Demostration

video-cut-short-version-speed1-transcode.mp4

RA-Q1: Demographic Analysis of Attack Types

1. Target Web Application Selection: Our chosen web applications are consistent with those used in prior work and are widely adopted in the community. In particular, TrainTicket is the largest publicly available deployable system for this domain. It consists of 41 microservices, spans multiple programming languages and frameworks, and features a complex inter-microservice architecture, making it a highly representative and challenging benchmark.

2. Diversity of Attacks and Constraint Violations: Our experiments encompass a wide range of realistic and impactful attack types, including injection attacks, unauthorized data access, data constraint violations, and identity spoofing. From the perspective of invariant constraints, the attack scenarios are designed to violate various types of constraints—such as intra-API constraints, API-to-API constraints, API-to-database constraints, API-to-environment constraints, as well as their combinations. A detailed breakdown of attack categories and their corresponding counts is presented in the following table. Also from the perspective of the constraints level, our attack suite spans all three levels of violation granularity: column-level, tuple-level, and relation-level.

RA-Q3: Any ablation study to figure out the impact of different LLMs?

We have added an ablation study to investigate the performance gap between GPT-4o and GPT-4o-mini. Specifically, we manually analyzed 10 invariants that were successfully generated by GPT-4o but failed by GPT-4o-mini. Our findings indicate that the primary reason for GPT-4o-mini's poor performance is its limited code generation capability. Out of the 10 invariants that GPT-4o-mini failed to produce, 9 failures were due to incorrect or malformed code. Notably, in these 9 cases, the natural language descriptions of the invariants were correctly generated, but the model failed to translate them into executable code. This suggests that the issue lies primarily in code synthesis, rather than understanding or reasoning. We have incorporated the details of this analysis into the revised version of the paper.

RB-Q1: Prompts

All detailed prompts are available at [Link].

RC-Q2 Mapping Table Between WebNorm and MINES

The datailed attack information and the mapping table between WebNorm and MINES is in the following table.

MINES Attacks vs WebNorm Attacks (Anonymous Author)

RA-Q1': Examples of generated invaraints and detected anomalies.

All examples are available at [Link].

RA-Q3': Generating from Log Instances

In the ablation study of "w/o deducing", we employ LLMs to directly induce invariants from detailed logs. For a fair comparison, we further augment the raw logs by incorporating joined information into a combined log. This combined log includes not only the original data but also additional context extracted through join operations across related tables. An example of such a detailed log is shown below. In this ablation study, we replace schema-level information with these combined logs.

RB-Q5': Experimental Results for All Datasets

All experimental results are available at [Link].

Page updated

Google Sites

Report abuse