Datasets
Datasets
FlowBench: Anomaly Detection Benchmark Dataset on Scientific Workflows
FlowBench is a benchmark dataset designed for anomaly detection in computational workflows. It includes execution traces from distributed infrastructures with systematically injected, labeled anomalies and provides both raw execution logs and a parsed version. The dataset is accompanied by tools and sample code for parsing, loading, and processing the data, supporting various supervised and unsupervised anomaly detection models using PyTorch.
TCP Conflict Study: A Dataset on TCP Conflict Using Multiple TCP CCA and AQM Algorithms
This dataset evaluates TCP Congestion Control Algorithms (CCAs) such as BBRv1, BBRv2, CUBIC, Reno, and HTCP under AQMs like FIFO, RED, and FQ-CoDel. It captures performance metrics for bottleneck bandwidths up to 25 Gbps, focusing on fairness, link utilization, and retransmissions across varying queue lengths and inter/intra-CCA scenarios. This dataset is highly versatile, containing pcap packet trace details, iperf traces, and ping traces. Key parameters include packet-level traces, RTT, CWND, throughput, and goodput, supporting research on TCP fairness and performance in diverse network setups.
RADT (Real-world Advanced Dataset): A TCP Dataset for Studying Fairness in Network links
RADT extends exploration to bottleneck bandwidths up to 40 Gbps and includes performance data for BBRv3 alongside BBRv1, BBRv2, CUBIC, Reno, and HTCP under AQMs like FIFO, RED, and FQ-CoDel. It also includes key parameters such as packet-level traces, RTT, CWND, throughput, and goodput. RADT focuses on real-world scalability and efficiency, offering a concise and specialized dataset for evaluating and optimizing next-generation TCP algorithms in high-speed networks. Additionally, it serves as a valuable resource for developing AI-based CCAs through advanced training data.
Graph Neural Network for Anomaly Detection and Classification in Scientific Workflows
This contains workflow execution traces, including adjacency lists representing dependencies between workflow nodes and raw data characterizing job executions under various scenarios. It features workflows with normal and anomalous jobs, capturing anomalies like CPU, HDD, and loss-related issues. The dataset is designed to support graph-level and node-level anomaly detection and includes detailed graph statistics such as the number of nodes, edges, and anomaly distributions across workflows like 1000genome and wind-clustering-casa.
Large Language Models for Anomaly Detection in Computational Workflows: from Supervised Fine-Tuning to In-Context Learning
It includes workflow execution traces with systematically injected anomalies, both labeled and unlabeled, designed to support multiple anomaly detection approaches. The dataset provides raw execution logs, parsed versions for easier analysis, and data formatted for fine-tuning and in-context learning tasks. It enables researchers to explore LLM-based supervised fine-tuning and zero-shot or few-shot anomaly detection strategies in computational workflow scenarios.
These datasets are made available under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Please use the following citation format when referencing these datasets:
Authors. Dataset Title. [Version]. Available at: [URL]