PoSeiDon - Datasets

Datasets

FlowBench: Anomaly Detection Benchmark Dataset on Scientific Workflows

FlowBench is a benchmark dataset designed for anomaly detection in computational workflows. It includes execution traces from distributed infrastructures with systematically injected, labeled anomalies and provides both raw execution logs and a parsed version. The dataset is accompanied by tools and sample code for parsing, loading, and processing the data, supporting various supervised and unsupervised anomaly detection models using PyTorch.

Paper: https://arxiv.org/abs/2306.09930
GitHub: https://github.com/PoSeiDon-Workflows/FlowBench
Documentation: https://poseidon-workflows.github.io/FlowBench/

TCP Conflict Study: A Dataset on TCP Conflict Using Multiple TCP CCA and AQM Algorithms

This dataset evaluates TCP Congestion Control Algorithms (CCAs) such as BBRv1, BBRv2, CUBIC, Reno, and HTCP under AQMs like FIFO, RED, and FQ-CoDel. It captures performance metrics for bottleneck bandwidths up to 25 Gbps, focusing on fairness, link utilization, and retransmissions across varying queue lengths and inter/intra-CCA scenarios. This dataset is highly versatile, containing pcap packet trace details, iperf traces, and ping traces. Key parameters include packet-level traces, RTT, CWND, throughput, and goodput, supporting research on TCP fairness and performance in diverse network setups.

FIFO Data: https://workflow.isi.edu/Poseidon/TCP_Study_Data/fifo_data.tar.xz
FQ_Codel Data: https://workflow.isi.edu/Poseidon/TCP_Study_Data/fq_codel_data.tar.xz
RED Data: https://workflow.isi.edu/Poseidon/TCP_Study_Data/red_data.tar.xz

RADT (Real-world Advanced Dataset): A TCP Dataset for Studying Fairness in Network links

RADT extends exploration to bottleneck bandwidths up to 40 Gbps and includes performance data for BBRv3 alongside BBRv1, BBRv2, CUBIC, Reno, and HTCP under AQMs like FIFO, RED, and FQ-CoDel. It also includes key parameters such as packet-level traces, RTT, CWND, throughput, and goodput. RADT focuses on real-world scalability and efficiency, offering a concise and specialized dataset for evaluating and optimizing next-generation TCP algorithms in high-speed networks. Additionally, it serves as a valuable resource for developing AI-based CCAs through advanced training data.

Graph Neural Network for Anomaly Detection and Classification in Scientific Workflows

This contains workflow execution traces, including adjacency lists representing dependencies between workflow nodes and raw data characterizing job executions under various scenarios. It features workflows with normal and anomalous jobs, capturing anomalies like CPU, HDD, and loss-related issues. The dataset is designed to support graph-level and node-level anomaly detection and includes detailed graph statistics such as the number of nodes, edges, and anomaly distributions across workflows like 1000genome and wind-clustering-casa.

GitHub: https://github.com/PoSeiDon-Workflows/graph_nn_2

Large Language Models for Anomaly Detection in Computational Workflows: from Supervised Fine-Tuning to In-Context Learning

It includes workflow execution traces with systematically injected anomalies, both labeled and unlabeled, designed to support multiple anomaly detection approaches. The dataset provides raw execution logs, parsed versions for easier analysis, and data formatted for fine-tuning and in-context learning tasks. It enables researchers to explore LLM-based supervised fine-tuning and zero-shot or few-shot anomaly detection strategies in computational workflow scenarios.

GitHub: https://github.com/PoSeiDon-Workflows/LLM_AD

These datasets are made available under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).