CAShift Dataset
we collect system logs from three highly influential open-source applications (WordPress, Joomla and Jinja). In addition to the normal logs, CAShift contains attack logs based on existing CVE vulnerabilities found in these three applications. Considering shift scenarios, we collect logs from WordPress under its three different versions and three different cloud container runtime environments. We also include 20 types of CVE vulnerabilities from various components of cloud-native systems. We replay these vulnerabilities in their respective affected components and versions to collect the corresponding logs. After that, the capability of LAD methods in handling normality shifts is evaluated by the collected datasets. Furthermore, continuous learning methods are employed to select important data to adapt LAD models to new data distribution.
Dataset information
Scripts to capture system call logs in the cloud (normal and attacks)
Proof of Concept (PoC) and CVE information for 20 attack scenarios
Vulnerabilities included in CAShift Dataset
Dataset Constitution
We select (Kubernetes system using containerd and runc deployed with WordPress in version 6.2) as [Base logs] and set shift logs in cloud application (Jinja2) as [App-1], (Joomla) as [App-2]. (WordPress version 4.8) as [Version-1], (WordPress version 5.6) as [Version-2], cloud runtime (containerd with gVisor) as [Arch-1] and cloud runtime (cri-o with runc) as [Arch-2].
All Application Shift and Version Shift logs are collected on the Kubernetes system using containerd and runc container runtimes.
Common attack surfaces in cloud systems
Quantitative Analysis of CAShift
We employ the commonly used unsupervised method t-SNE to help illustrate the embedding distribution of log information. Specifically, we randomly sample 100 logs from each shift scenario and all attack logs first and then employ the pre-trained BERT-base model to produce the embeddings of each log sample.
T-SNE visualization of shift logs comparing to attack logs and normal logs