BACKGROUND PROMPT
You are a data scientist working for a cybersecurity company.
Your task is to analyze the given {rule_type} rule description and generate key insights from it.
## Below are the instructions for the task:
- You will be provided with a rule description.
- You will be asked several questions to break down the rule description and generate key logic.
- You need to select the most appropriate question to answer based on the rule description.
- If not necessary, you do not need to generate the analysis output.
## Below is the description of the rule:
{description}
## Below are the functions of the questions:
- Understanding the problem: Understand the monitoring objectives and outputs mentioned in the description.
- Identifying data sources: Identify the data sources mentioned in the description.
- Defining initial filters: Define the filtering conditions that need to be applied.
- Extracting relevant fields: Extract the fields that need to be extracted from the logs.
- Performing data aggregation: Calculate the metrics that need to be calculated.
- Calculating derived metrics: Compute any additional metrics that need to be computed.
- Filtering anomalies: Define how anomalies are identified and filtered.
QUESTION LIST
_UNDERSTANDING_PROBLEM_PROMPT = '''
What are the monitoring objectives and outputs mentioned in the description? What fields and values need to be filtered?
'''
_IDENTIFY_DATA_SOURCE_PROMPT = '''
What data sources are mentioned in the description? Are the index or sourcetype explicitly stated?
'''
_DEFINE_INITIAL_FILTERS_PROMPT = '''
What filtering conditions need to be applied? Are there any values or logs that should be excluded?
'''
_EXTRACT_RELEVANT_FIELDS_PROMPT = '''
Which fields need to be extracted from the logs? Do any new helper fields need to be created?
'''
_PERFORM_DATA_AGGREGATION_PROMPT = '''
What metrics need to be calculated? Which fields should the data be grouped by?
'''
_CALCULATE_DERIVED_METRICS_PROMPT = '''
Are there any additional metrics (e.g., success rate or anomaly score) that need to be computed? What is the calculation logic?
'''
_FILTER_ANOMALIES_PROMPT = '''
How are anomalies defined? What filtering conditions or thresholds need to be applied?
'''
_OPTIMIZE_OUTPUT_PROMPT = '''
What fields should be included in the final output? Do any fields need to be renamed or macros applied?
'''
DSL GENERATION PROMPT
You are an expert in Splunk rules and DSL writing.
Your task is to generate a DSL rule based on the {rule_type} rule description and the analysis of the description.
Below is the instruction for the task:
- You will be provided with the rule description and one part of the analysis, which describes the requirements for the rule.
- You may need to generate SEVERAL DSL rules to achieve the desired functionality.
- You only need to generate the DSL rule for the given information.
- Note that the DSL rules you generate will be combined to form the final rule, so do not be duplicative.
## Below is the structure of the DSL rule:
<KEYWORD> | PARAMS {{ key1 = value1, key2 = value2, ... }} | MODULES {{ keyA = valueA, ... }}
## Below is the description of each part of the DSL rule:
- KEYWORD: describes the function of pipe (main classification).
- PARAMS: defines the required parameters of the command, such as source data, regular expression, filter conditions, etc.
- MODULES: describes optional parameters and extended functions, such as grouping, sorting, aggregate functions, subqueries, etc.
## Below is the description of the KEYWORD:
{keyword}
## Below is the description of the rule:
{rule_description}
## Below is the structure of your output:
```plaintext
<YOUR_GENERATED_DSL_RULE_1>
<YOUR_GENERATED_DSL_RULE_2>
...
```
DSL OPTIMIZE PROMPT
You are an expert in Splunk rules and DSL design.
Your task is to optimize the following DSL rule by removing unnecessary parts and optimizing the structure.
## Below is the instruction for the task:
- You can REMOVE, MODIFY, MERGE, or ADD parts to the DSL rule to make it more concise and efficient.
- The optimized rule should have clear logic and contain only the essential parts required to achieve the desired functionality.
- Similar functionalities can be combined or simplified.
## Below is the structure of the DSL rule:
<KEYWORD> | PARAMS {{ key1 = value1, key2 = value2, ... }} | MODULES {{ keyA = valueA, ... }}
## Below is the description of each part of the DSL rule:
- KEYWORD: describes the function of pipe (main classification).
- PARAMS: defines the required parameters of the command, such as source data, regular expression, filter conditions, etc.
- MODULES: describes optional parameters and extended functions, such as grouping, sorting, aggregate functions, subqueries, etc.
## Below is the description of the KEYWORD:
{keyword}
## Below is the structure of your output:
```plaintext
<YOUR_OPTIMIZED_DSL_RULE>
```
## Below is an example of excellent optimized DSL rule:
### Description:
This analytic is designed to identify attempts to exploit a server-side template injection vulnerability in CrushFTP, designated as CVE-2024-4040. This severe vulnerability enables unauthenticated remote attackers to access and read files beyond the VFS Sandbox, circumvent authentication protocols, and execute arbitrary commands on the affected server. The issue impacts all versions of CrushFTP up to 10.7.1 and 11.1.0 on all supported platforms. It is highly recommended to apply patches immediately to prevent unauthorized access to the system and avoid potential data compromises. The search specifically looks for patterns in the raw log data that match the exploitation attempts, including READ or WRITE actions, and extracts relevant information such as the protocol, session ID, user, IP address, HTTP method, and the URI queried. It then evaluates these logs to confirm traces of exploitation based on the presence of specific keywords and the originating IP address, counting and sorting these events for further analysis.
### Output:
```plaintext
EXTRACT | PARAMS {{ source = "_raw", regex = "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" }}
TRANSFORM | PARAMS {{ target = "message", condition = "match(_raw, 'INCLUDE') and isnotnull(src_ip)", action = "if(condition, 'traces of exploitation by ' . src_ip, 'false')" }}
FILTER | PARAMS {{ condition = "message != 'false'" }}
RENAME | PARAMS {{ fields = {{ "host" : "dest" }} }}
AGGREGATE | PARAMS {{ action = "count", fields = "_time, dest, source, message, src_ip, http_method, uri_query, user, action" }}
SORT | PARAMS {{ fields = "_time", order = "DESC" }}
```
KEYWORD LIST
DSL_KEYWORD = {
"FILTER": "Filters data based on conditions or field values.",
"EXTRACT": "Extracts fields or values from raw data or JSON-like structures.",
"TRANSFORM": "Performs calculations or transformations on fields, including renaming or formatting.",
"LOOKUP": "Joins external lookup tables with the current dataset to enrich data.",
"AGGREGATE": "Groups and summarizes data by applying statistical functions like count, sum, min, or max.",
"RENAME": "Renames fields to simplify field names or align with conventions.",
"JOIN": "Combines results from different datasets or subqueries based on common fields.",
"SORT": "Sorts results based on specified fields in ascending or descending order.",
"APPEND": "Appends additional data or results from a subquery to the current dataset.",
"FILL": "Fills null or missing values in fields with default or calculated values.",
"DEDUP": "Removes duplicate records based on specified fields.",
"OUTPUT": "Formats or outputs results for display or export.",
"BUCKET": "Groups data into discrete ranges or intervals, such as time or numeric ranges.",
"APPLY": "Applies pre-trained models or predefined rules to the data for evaluation.",
"DEBUG": "Used for debugging queries or analyzing performance issues."
}
RULE GENERATE FROM DSL PROMPT
You are a security analyst working for a cybersecurity company.
Your task is to generate a {rule_type} rule based on the DSL rule and the analysis of the description.
## Below are the instructions for the task:
- You will be provided with the DSL rule.
- You only need to generate the rule for the given information.
- You will be given the introduction of the DSL rule.
- Based on the DSL rule and its introduction, you need to generate the corresponding {rule_type} rule.
## Below is the structure of the DSL rule:
<KEYWORD> | PARAMS {{ key1 = value1, key2 = value2, ... }} | MODULES {{ keyA = valueA, ... }}
## Below is the description of each part of the DSL rule:
- KEYWORD: describes the function of pipe (main classification).
- PARAMS: defines the required parameters of the command, such as source data, regular expression, filter conditions, etc.
- MODULES: describes optional parameters and extended functions, such as grouping, sorting, aggregate functions, subqueries, etc.
## Below is the description of the KEYWORD:
{keyword}
## Below is the structure of your output:
```spl
<YOUR_GENERATED_RULE>
```
## Below is an example of input and output:
### Input:
```plaintext
FILTER | PARAMS {index="auth_logs", source="WinEventLog:Security", earliest=-30m} | MODULES {"Aggregate login attempts", "Detect brute force login attempts"}
```
### Output:
```spl
index="auth_logs" source="WinEventLog:Security" earliest=-30m | stats count by src_ip | where count >10
```
OPTIMIZE RULE PROMPT
You are an expert in SIEM rule optimization, specializing in Splunk SPL queries. Your task is to improve a given Splunk rule based on a provided ground truth rule.
## Input:
* Splunk Rule (to be improved): A Splunk SPL query that may have inefficiencies, missing filters, or suboptimal logic.
* Ground Truth Rule: A well-crafted reference SPL query that demonstrates best practices in detection logic.
## Task:
* Analyze the ground truth rule and identify its core detection logic, including key filtering conditions, field usage, and structure.
* Improve the given Splunk rule by incorporating the essential logic from the ground truth while retaining its original intent as much as possible.
* Change the whole rule or parts of it as needed to make it similar to the ground truth.
* Do NOT change the length or complexity of the original rule. But if the rule is too long, you can simplify it.
* Do NOT change the original structure of the rule, such as pipes, commands, or functions.
* Ensure the optimized rule remains efficient, accurate, and aligned with SIEM detection best practices.
## Output:
Provide the improved SPL query.
## Output Format:
```spl
<YOUR_IMPROVED_SPL_RULE>
```
LLM EVALUATION PROMPTS
LLM_EVALUATION_PROMPTS = {
"Logical Consistency": '''
Compare the logical structure of the following two Splunk rules. Do they follow the same detection logic? List any differences in conditions, operators, or filters.
Respond with a score from 0 (completely inconsistent) to 1 (fully consistent) and explain any detected differences.
Your output should be in JSON format:
{{
"score": SCORE,
"explanation": "EXPLANATION"
}}
Below is an example of the input and output:
Input:
```spl
`crushftp` | rex field=_raw "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" | eval message=if(match(_raw, "INCLUDE") and isnotnull(src_ip), "traces of exploitation by " . src_ip, "false") | search message!=false | rename host as dest | stats count by _time, dest, source, message, src_ip, http_method, uri_query, user, action | sort -_time| `crushftp_server_side_template_injection_filter`
```
Output:
{{
"score": 0.9,
"explanation": "The first rule uses 'eval' to create a message field, while the second rule uses 'rex' to extract fields. Both rules filter out false messages, but the first rule has an additional 'rename' step."
}}
''',
"Syntax Correctness": '''
Check if the following Splunk rule follows proper SPL syntax and best practices. Point out any errors or inefficient expressions and suggest improvements.
Respond with a score from 0 (incorrect syntax) to 1 (perfect syntax) and list any improvements.
Your output should be in JSON format:
{{
"score": SCORE,
"explanation": "EXPLANATION"
}}
Below is an example of the input and output:
Input:
```spl
`crushftp` | rex field=_raw "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" | eval message=if(match(_raw, "INCLUDE") and isnotnull(src_ip), "traces of exploitation by " . src_ip, "false") | search message!=false | rename host as dest | stats count by _time, dest, source, message, src_ip, http_method, uri_query, user, action | sort -_time| `crushftp_server_side_template_injection_filter`
```
Output:
{{
"score": 0.8,
"explanation": "The rule has a valid syntax, but the 'eval' function could be simplified. The 'search' command could also be optimized to reduce processing time."
}}
''',
"Readability & Maintainability": '''
Evaluate the readability and maintainability of the following Splunk rule. Is the rule easy to understand and modify? Identify any unnecessary complexity.
Rate from 0 (hard to read and maintain) to 1 (clear and maintainable), and explain your reasoning.
Your output should be in JSON format:
{{
"score": SCORE,
"explanation": "EXPLANATION"
}}
Below is an example of the input and output:
Input:
```spl
`crushftp` | rex field=_raw "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" | eval message=if(match(_raw, "INCLUDE") and isnotnull(src_ip), "traces of exploitation by " . src_ip, "false") | search message!=false | rename host as dest | stats count by _time, dest, source, message, src_ip, http_method, uri_query, user, action | sort -_time| `crushftp_server_side_template_injection_filter`
```
Output:
{{
"score": 0.85,
"explanation": "The rule is generally readable, but the use of 'eval' could be simplified. The 'rex' command is clear, but the regex could be commented for better understanding."
}}
''',
"Condition Coverage": '''
Does the following generated rule cover all the conditions specified in the ground truth rule? Highlight any missing or extra fields, filters, or logic.
Rate from 0 (poor coverage) to 1 (full coverage) and explain the differences found.
Your output should be in JSON format:
{{
"score": SCORE,
"explanation": "EXPLANATION"
}}
Below is an example of the input and output:
Input:
```spl
`crushftp` | rex field=_raw "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" | eval message=if(match(_raw, "INCLUDE") and isnotnull(src_ip), "traces of exploitation by " . src_ip, "false") | search message!=false | rename host as dest | stats count by _time, dest, source, message, src_ip, http_method, uri_query, user, action | sort -_time| `crushftp_server_side_template_injection_filter`
```
Output:
{{
"score": 0.9,
"explanation": "The rule covers most conditions, but it does not include a check for the 'action' field being 'READ' or 'WROTE'. The regex is comprehensive, but the logic could be more explicit."
}}
''',
"False Positive & False Negative Risk": '''
Analyze the following rule for its potential risk of false positives and false negatives. Is it too strict or too loose in detecting real threats?
Respond with a score from 0 (high risk of false positives/negatives) to 1 (low risk) and explain your evaluation.
Your output should be in JSON format:
{{
"score": SCORE,
"explanation": "EXPLANATION"
}}
Below is an example of the input and output:
Input:
```spl
`crushftp` | rex field=_raw "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" | eval message=if(match(_raw, "INCLUDE") and isnotnull(src_ip), "traces of exploitation by " . src_ip, "false") | search message!=false | rename host as dest | stats count by _time, dest, source, message, src_ip, http_method, uri_query, user, action | sort -_time| `crushftp_server_side_template_injection_filter`
```
Output:
{{
"score": 0.75,
"explanation": "The rule is somewhat strict in filtering messages, which may lead to false negatives if the 'INCLUDE' keyword is not present. However, it does a good job of identifying potential exploitation attempts, which reduces the risk of false positives."
}}
''',
"Execution Efficiency": '''
Evaluate how efficiently the following Splunk rule would run on large datasets. Are there any operations or filters that could slow down execution? Suggest possible optimizations.
Provide a score from 0 (inefficient) to 1 (highly efficient), and suggest improvements if necessary.
Your output should be in JSON format:
{{
"score": SCORE,
"explanation": "EXPLANATION"
}}
Below is an example of the input and output:
Input:
```spl
`crushftp` | rex field=_raw "\[(?<protocol>HTTPS|HTTP):(?<session_id>[^\:]+):(?<user>[^\:]+):(?<src_ip>\d+\.\d+\.\d+\.\d+)\] (?<action>READ|WROTE): \*(?<http_method>[A-Z]+) (?<uri_query>[^\s]+) HTTP/[^\*]+\*" | eval message=if(match(_raw, "INCLUDE") and isnotnull(src_ip), "traces of exploitation by " . src_ip, "false") | search message!=false | rename host as dest | stats count by _time, dest, source, message, src_ip, http_method, uri_query, user, action | sort -_time| `crushftp_server_side_template_injection_filter`
```
Output:
{{
"score": 0.8,
"explanation": "The rule uses 'stats' and 'sort', which can be resource-intensive on large datasets. Consider using 'dedup' before 'stats' to reduce the number of events processed. The regex in 'rex' is also complex and could be optimized for performance."
}}
'''
}
SCORE PROMPT
You are a rule reflection and optimization assistant.
You need to evaluate the following Splunk rule based on the given criteria and its ideal description.
Here is the rule to evaluate:
{rule}
Below is the description of the ideal rule:
{ideal_rule}
Rate it on:
1) Logical Coherence (0-1)
2) Syntax Validation (0-1)
3) Execution Feasibility (0-1)
Provide a JSON with your ratings and a short comment:
{{
"logical_coherence": 0.0,
"syntax_validation": 0.0,
"execution_feasibility": 0.0,
"comment": "Your analysis"
}}