Dataset Overview
We collected raw application logs from instrumented back-ends. To obtain normal logs, we ran test cases (seeds) on the front-end to simulate user activity. For attack logs, we performed web tampering attacks on the front-end. This log collection stage was applied to two web application benchmarks: TrainTicket and NiceFish. The details of the raw logs are as follows:
Log Instrumentation
The raw logs can be grouped into two structural categories: those invoked by service methods (business logic layer) and by repository methods (persistence/database layer), both of them are instrumented in the formats:
Timestamp
Method and service class
Arguments
Request headers
Execution time
Return
2024-05-20 13:38:18.463 INFO 1 --- [http-nio-18898-exec-2] s.s.LoggingAspect: Entering in Method: getLeftTicketOfInterval, Class: seat.service.SeatServiceImpl, Arguments: [Seat(travelDate=2024-05-20 00:00:00, trainNumber=Z1236, startStation=nanjing, destStation=shanghai, seatType=3, totalNum=2147483647, stations=[taiyuan, shijiazhuang, nanjing, shanghai])], Request Headers: {...}, Execution Time: [TIME] milliseconds, Return: Response(status=1, msg=Get Left Ticket of Internal Success, data=1073741823)
Normal Training Log Collection (Seeds)
A set of pre-defined seeds (i.e., User booking tickets / User Login) in addition to Industry Fault in the benchmark for building our reference behavioral model. These seeds are executed to cover the web application's full observability. The next video is the pay seeds (Login -> Get OrderList -> Pay)
Normal Testing Log Collection (GPT-triggered)
We generate the normal scenarios by employing GPT-4 to simulate user interactions based on different pre-defined tasks.
Attack Logs Collection
No Input Type Validation: (Common sense) The tamper of URL, Body and the Location Address mainly happened due to the lack of input type validation, where any datatype entered by a “user” can be processed. For example, attackers can manipulate navigation URLs to include malicious payloads, redirecting users to attacker-controlled sites; More seriously, attackers can enter a negative baggage weight into a luggage consignment form, resulting in the escape of consign payment.
No Database Validation: (data transition) Most websites have implemented the input type validation, however, tamper can still happen when the input is not verified with the database. Attackers can manipulate the URLs, asynchronous request bodies, and tamper malicious data to submit the server. For example, an attacker can tamper API.payment and change the price from 100 to 10. If the server has no validation between the submitted price and the responsible price with the UserID, then the attacker can escape a high price to get the membership service.
Bypass Database Validation: (Shared Database) Even the applications are equipped with database verification, an sophicated attacker can still bypass the database validation, as most back-end would jump the initial value when the UserID does not belong to any loop functions. This is because that most backend would embed the database validation into loop functions, so that an attacker can manipulate the validation logic to bypass all the loop functions. For example, an attacker can adjust the requested body, and enumerate all the loop functions to control the modified price value of a UserID does not enter any loops, bypassing the database validation, achieving the malicious behaviours.
Bypass Input Result Validation: (Flow) Even the applications can detect the bypass database vulnerabilities, the attacker can manipulate the logic flow to tamper the results of validation processes. For example, the attacker might set a breakpoint at a point in the JavaScript code where the server response is processed. When the execution pauses at the breakpoint, the attacker can modify the response data by changing a validation flag from false to true. This alteration can trick the front end into believing that the data passed the validation checks, thus allowing the attacker alter the application's behavior.
Processed Logs (for Baselines)
WebNorm's constraint learning phase can use the raw logs as-is, with reference to the event graph for log selection based on event relations. However, the zbaselines used in our experiment require logs that are parsed in a specific format
/nicefish-spring-boot/nicefish-cms
$ stdbuf -oL -eL mvn spring-boot:run 2>&1 | stdbuf -oL grep "com.nicefish.cms.service.LoggingAspect : \|o.t.ehcachedx.monitor.util.RestProxy :" | tee my_process_info.log
/NiceFish-React
$ npm run start:dev-backend
mvn clean install -DskipTests
xvfb-run --auto-servernum --server-num=1 python logs_collector.py
pkill -x 'chrome'
pkill -x 'chromedriver'
----------------------------------------------------------------------------------------------------
Instrumented Logs
Input = RebookInfo (oldTripId=D1345, tripId=D1345, seatType=2, date=2024-06-03)
API= Rebook.service.RebookServiceImpl
Output = Response (orderMoneyDifference=27.5)