Human Study Materials
The study is conducted to evaluate the realism and practical utility of the ChimeraLog dataset in comparison with two widely used insider threat datasets: CERT v6.2 and TWOS. The goal was to determine whether logs generated by Chimera realistically reflect real-world enterprise activities and insider threat scenarios.
We invited five independent experts, each with at least five years of experience in security and artificial intelligence, from esteemed universities and leading security companies. All of the experts possess deep familiarity with insider threat scenarios within large corporations. Each expert individually evaluated the 100 sampled log snippets for each dataset. The experts were presented with the same set of log entries, which were shuffled to prevent bias.
Number of Experts: Five independent domain experts participated in the human evaluation.
Qualification: We recruit experts who have at least five years of professional experience in cybersecurity or industrial AI practices and are affiliated with reputable universities or leading security technology firms. Participation was voluntary, with no financial incentives provided, to minimize potential bias.
Ethical Approval and Informed Consent: All participants provide written informed consent after being informed of the study’s purpose, procedures, and their right to withdraw at any time, in accordance with ethical principles. The detailed invitation letter template for the participants is listed below.
Anonymity: All information about the experts is anonymized during the summaries and analyses process of our research. Evaluation responses are aggregated to prevent the identification of individual participants.
Since it is infeasible to manually inspect every entry of our collected logs and the existing dataset, we follow established research methodologies and apply stratified sampling for human assessment. Each expert independently reviewed 100 stratified samples (50 benign, 50 attack) from ChimeraLog, CERT, and TWOS, covering four application-level modalities (Logon, Email, Web history, and File operation).
Sample allocation follows Neyman allocation principles, whereby the number of selected entries for each stratum is determined based on within-stratum variability. Specifically, M denotes the total number of log
modalities, N(c)m represents the total number of logs in modality m for class c, and n(c) is the target sample size for each class c ∈ {Benign, Attack}, which in our study is 50. Since the within-stratum standard deviation S(c)m is not available a priori, we revert to proportional allocation based on stratum size, rounding n(c) select to the nearest integer for each modality and class.
We used a 5-point Likert scale with clearly labeled response options (e.g., “Very Unrealistic” to “Very Realistic”) to ensure consistency and ease of understanding.
Each statement is single-focused, targeting one dimension (e.g., realism, coherence) to avoid confusion and double-barreled questions.
Items were pilot-tested with all the authors to fine-tune clarity and interpretation, reducing misinterpretations. To minimize response biases, such as acquiescence, the questions are neutrally phrased, and anonymity in responses is ensured.
Questionnaire Template
Scale (1 = Very Unrealistic … 5 = Very Realistic)
For benign logs:
The timestamps and event frequency align with a typical workday rhythm.
Overall, I would believe this log segment was captured in a real production environment.
For attack logs:
The combination of benign and suspicious events reflects realistic routine and abnormal log patterns.
Attack-related entries show coherent intent and progression, rather than randomness or artificial generation.
Optional:
If the log data is considered not realistic, please specify the reason:
We calculate the average ratings given to each of the questions in the dataset per participant, and present the results in the Figure shown below. The x-axis distinguishes each dataset we used for evaluation, where we separate the three scenarios of ChimeraLog into three rating candidates, and the y-axis shows the average ratings. The results demonstrate that all three organizational scenarios simulated in ChimeraLog received expert recognition for their high degree of realism, comparable to the real-world TWOS dataset. Specifically, the five participating experts awarded an average realism score of 4.20 to ChimeraLog, which is only marginally lower than the 4.25 average score assigned to TWOS. This suggests that experts perceive the logs in both datasets as highly natural and realistic.
In contrast, the CERT dataset received consistently low scores, with an average of 1.78, reflecting experts’ views that its logs lack realism. The primary criticism was that CERT focuses primarily on system graph construction and populates logs with randomly generated, semantically impoverished content.
Figure: Results of the realism study by human experts. The y-axis corresponds to the average ratings (1 refers to very unrealistic; 5 refers to very realistic). The x-axis represents each dataset.
Figure: Example of the email communication data in three datasets