To further evaluate the effectiveness of RulePilot, we conducted a controlled user study focusing on natural language–to–rule conversion. The goal is to assess how users with varying levels of security expertise perform in writing detection rules, both with and without the assistance of our system.
This section provides a comprehensive evaluation of a representative rule generation task.
Study Setup
Participants were asked to convert a real-world detection scenario into a syntactically valid and logically correct SIEM rule. The task was based on an official Splunk detection rule categorized under Endpoint / Privilege Escalation, titled:
Original Description:
This analytic detects instances where a user adds themselves to an Active Directory (AD) group. This activity is a common indicator of privilege escalation, where a user attempts to gain unauthorized access to higher privileges or sensitive resources. By monitoring AD logs, this detection identifies such suspicious behavior, which could be part of a larger attack strategy aimed at compromising critical systems and data.
Participants
General Users (Novice Group)
Participants in this group were upper-level undergraduate or graduate students majoring in Computer Science, with no formal background in cybersecurity or SIEM operation. While they were proficient in general-purpose querying languages such as SQL, they had no prior exposure to SIEM-specific rule grammars (e.g., SPL or KQL), nor experience in writing security detection rules. They represent technically capable but domain-agnostic users, reflecting the growing need for accessible security tooling in broader IT roles.
Junior Analysts (Entry-Level SOC Group)
This group consisted of interns from a partner cybersecurity company, all of whom were majoring in Cybersecurity. All junior analysts had:
Prior coursework or training in intrusion detection systems, log analysis, and SIEM architecture.
Familiarity with SQL syntax and basic SPL constructs, including search, eval, and where clauses.
However, their hands-on experience in constructing operational detection rules from natural language or abstract descriptions was minimal, and none had written production-grade rules before.
Metrics Collected
For each participant, we recorded the following:
Time Taken: Time (in minutes) to produce a complete rule
Final Rule Output: The rule as written or generated&modified
Syntax Validity: Whether the rule passes vendor-side syntax checks (e.g., Splunk)
Logical Alignment: Whether the rule logic matches the original description (as judged by an expert and LLM evaluator)
General Users without RulePilot
Task Execution Timeline:
0–5 min: Read the description and attempt to identify relevant actions (e.g., "user adds themselves to AD group")
5–11 min: Searched online for “Splunk rule examples” and read the Splunk official documentation on search, index, and eval
11–19 min: Drafted a basic query using guessed field names (user, group, add)
19–23 min: Revised syntax multiple times after encountering errors in the SPL sandbox
Total Time: 23 minutes
Reference Materials:
Splunk documentation (https://docs.splunk.com)
Stack Overflow and GitHub Gist for sample detection rules
Final Rule Output:
index=wineventlog_security
| eval is_self_add = if(user = TargetUserName, 1, 0)
| stats count(eval(is_self_add=1)) as self_add_count by user, TargetUserName, Group_Name
| where self_add_count > 0
Syntax Evaluation:
❌ if(user = TargetUserName, 1, 0) uses = instead of ==, which is invalid in SPL eval expressions.
⚠️ Uses field TargetUserName, which is not standard; likely guessed based on naming intuition — actual field might be src_user or embedded in raw message.
Logic Alignment:
❌ Fails due to wrong field names and syntax errors, so the rule will not execute as intended.
❌ Lacks context filtering (e.g., AD logs only), no safeguards against noise or irrelevant events.
General Users with RulePilot
Task Execution Timeline:
0–2 min: Copied and pasted the description into RulePilot UI
2–3 min: Reviewed generated rule, made one minor field name adjustment
Total Time: 3 minutes
Reference Materials:
None beyond RulePilot suggestion
Final Rule Output (via RulePilot):
index=wineventlog
| eval self_addition_flag=if(user == src_user, 1, 0)
| stats min(_time) as _time, dc(user) as usercount, values(user) as user, values(user_category) as user_category, values(src_user_category) as src_user_category, values(dvc) as dvc, values(ip_address) as ip_address by signature, Group_Name, src_user
| eval anomaly_score = if(self_addition_flag==1 AND usercount==1, 5, 0)
| where anomaly_score > 3
| rename ip_address as Source_IP
Syntax Evaluation:
✅ No syntax errors
Passed test execution in Splunk environment
Logic Alignment:
✅ Captured key condition user == target_user (self-addition)
✅ Included aggregation to support auditing/alerting
Junior Analysts without RulePilot
Task Execution Timeline:
0–3 min: Read the rule description and extracted key intent: "user adds self to privileged group"
3–10 min: Drafted initial query based on known fields and iteratively tested for SPL errors
10–15 min: Applied group filtering and attempted to simulate logical self-add condition
Total Time: 15 minutes
Reference Materials:
Splunk documentation (https://docs.splunk.com)
Stack Overflow and GitHub Gist for sample detection rules
Final Rule Output:
index=wineventlog_security
| eval is_self_add = if(user == src_user, "true", "false")
| search is_self_add="true" AND Group_Name IN ("Domain Admins", "Administrators")
| stats count by user, Group_Name, src_user, _time
Syntax Evaluation:
⚠️ Lacks additional filters such as eventtype, source, or Channel, which may introduce noise or irrelevant matches.
⚠️ The use of search is_self_add="true" is redundant and could be optimized via where or direct filtering in eval.
Logic Alignment:
⚠️ Does not incorporate time-window correlation or cross-event logic, limiting its effectiveness in detecting multi-step attacks.
❌ Assumes standardized field names (user, src_user, Group_Name) without accounting for variations across log sources or environments.
Junior Analysts with RulePilot
Task Execution Timeline:
0–2 min: Copied and pasted the description into RulePilot UI
2–3 min: Reviewed generated rule, made one minor field name adjustment
Total Time: 3 minutes
Reference Materials:
None beyond RulePilot suggestion
Final Rule Output (via RulePilot):
index=wineventlog_security
| eval self_addition_flag=if(user == src_user, 1, 0)
| stats min(_time) as _time, dc(user) as usercount, values(user) as user, values(user_category) as user_category, values(src_user_category) as src_user_category, values(dvc) as dvc, values(ip_address) as ip_address by signature, Group_Name, src_user
| eval anomaly_score = if(usercount==1, 5, 0)
| where anomaly_score > 3
| rename ip_address as Source_IP
Syntax Evaluation:
✅ No syntax errors
Passed test execution in Splunk environment
Logic Alignment:
✔️ Accurately models "self-addition" behavior using user == src_user.
✔️ Introduces a scoring mechanism (anomaly_score) to distinguish high-risk events, reflecting prioritization logic.
✔️ Aggregates relevant metadata (e.g., dvc, user_category, ip_address) to support downstream triage and investigation.
✔️ Uses de-duplication (dc(user)) to avoid false positives from repetitive logs.