Three full days of presentations, lightning talks, community discussions, and open forum sessions.
📅 June 9–11, 2026
🕘 9:00 am – 5:00 pm daily
📍 David Rubenstein Forum, University of Chicago
🌐 In person + remote attendance
Submissions are now closed. The sessions below represent contributions confirmed so far — more will be added as the program is finalized. If you would still like to contribute a session, reach out directly to the program committee co-chair, Nichamon, at nichamon@ogc.us.
Presentation — 30 minutes: Deployment experience, results, operational findings, and new LDMS capabilities shared with the community.
Lightning Talk — 15 minutes: Short, focused presentations suited for early-stage work, new tools, implementation experience, or topics that benefit from a concise format.
Focus Area Discussion — 60 or 90 minutes: Structured group discussions — not presentations. The goal is to surface shared challenges, generate community-wide understanding, and identify directions for future work. Every attendee has a voice.
Demo / Hands-on — 30 or 60 minutes: Walk through a working system, tool, or workflow with the community. Hands-on participation encouraged.
Tutorial — 30, 60, or 90 minutes: Focused learning opportunities on topics of practical value to the LDMS community.
Presentations share deployment experience, results, operational findings, and new LDMS capabilities with the community. Each presentation is 30 minutes.
Refine: A Robust Approach to Unsupervised Anomaly Detection for Production HPC Systems
Traditional unsupervised anomaly detection assumes clean training data — but real HPC environments are contaminated by anomalies from resource contention, software bugs, and hardware failures. Refine is a VAE-based framework that iteratively removes high-error samples during training, achieving an F1-score of 0.88 with up to 10% anomaly contamination and 100% accuracy on production data from the Eclipse cluster.
NERSC-Sponsored Enhancements to LDMS 2025
A presentation of LDMS enhancements delivered in 2025 with NERSC sponsorship, covering job state triggered sampling, high frequency samplers, storage plugin stats, export of file data to metric sets and streams, a stream-to-message guide, UNCORE counters, and WORKFLOW_ID support.
LDMSD: 2026 in Review and the Road to 2027
A two-part session covering what was delivered and learned in 2026 — new capabilities, infrastructure improvements, and lessons from community deployments — then looking ahead to the planned direction for LDMSD in 2027, including open problems the community is working through and what sites can expect in the coming year.
LDMS Configuration, Monitoring, and Load Balancing with Maestro
An overview of using Maestro for LDMS configuration management, monitoring, and load balancing — covering how Maestro simplifies operating LDMS at scale across complex deployments.
Overview of Advanced Configuration Best Practices on Advanced Technology Platforms
Establishing production LDMS observability on HPE EX systems across heterogeneous endpoints — RPM-based compute hosts and Slingshot switches running Debian on arm64. Contrasts two mature deployment patterns: a Kubernetes-centric model (CSM) at SNL and a Cloud-Init model (OpenCHAMI) at LANL, situated within a broader pipeline from automated package builds through downstream storage and analysis tooling.
Lightning talks are short, focused presentations — 15 minutes each — suited for early-stage work, new tools, implementation experience, or topics that benefit from a concise format.
Improving HPC System Administration, Efficiency, and Uptime Using LDMS Metrics and Time Series Analysis
Administrators of large HPC clusters often lack the information needed to diagnose failing jobs in time to prevent downtime. This talk presents a method using dynamic time warping (DTW) on LDMS-collected metrics — with fault injection via the HPAS suite to simulate real-world anomalies — to detect abnormal system behavior and potentially cancel problematic jobs before they cause system crashes.
Portable, Scalable Tools for HPC-Scale Monitoring Data Exploration Without a Supercomputer
Exploring LDMS monitoring data has typically required cluster-scale resources. This talk presents a portable, modular pipeline — ingestion, normalization, validation, metric selection, and analysis — that enables job-level performance characterization on a standard workstation, and extends to job grouping and workload characterization intended to inform low-latency schedulers like Slurm.
PID Data Analysis Tooling at Hyperscale
The volume of per-process data LDMS collects requires modern tooling to turn into actionable insight. This talk covers tooling being developed at Sandia to analyze PID data at scale — per-job executables, node hours, user frequency — and what system administrators, application owners, and code development managers can do with this data to optimize platforms and prioritize development decisions.
Containerized Development Workflows for Ansible-Configured Monitoring Data Collection and Analysis Clusters
Using containers to develop, validate, and test an Ansible-based LDMS deployment workflow enables rapid iteration, automated CI/CD validation, and parallel development without shared environment conflicts. This talk covers the container-based deployment architecture for LDMS services, its practical benefits, and key limitations that have prompted evaluation of virtual machines as an alternative.
Containerizing Grafana
Separating Grafana from the host system and packaging its dependencies in a container enables independent updates, simplified dependency management, and consistency across development and production. This talk covers the containerized deployment strategy in practice and future work toward a pod-based architecture incorporating analysis scripts and backend services.
Looks Real, Acts Real? Evaluating Synthetic HPC Logs
Synthetic HPC logs supplement real datasets, preserve privacy, and generate more diverse workloads for tasks like outlier detection. This talk evaluates five generative approaches — CTGAN, TVAE, CopulaGAN, DoppelGANger, and diffusion models — across logs from four HPC systems, assessing both distributional fidelity and practical utility for downstream tasks including machine learning applications and scheduling algorithms.
From Deployment and Debug to Documentation
Debugging sessions often produce working solutions whose steps are poorly documented or lost. A session-capture script records the debugging process using built-in Linux utilities, generating a structured report that an AI agent ingests into a searchable Confluence page — building a reusable knowledge base aggregated across sessions.
Focus area sessions are structured group discussions — not presentations. The goal is to surface shared challenges, generate community-wide understanding, and identify directions for future work. Every attendee has a voice.
Managing LDMSD Configuration Across a Complex HPC Center with YAML and Ansible — 60 minutes
A production experience report on using LDMSD's YAML configuration features and Ansible to manage the daemon hierarchy across a data analysis cluster and ten HPC systems — covering the responsibility-per-cluster approach and use of etcd, maestro, LDMS, SOS, and OWS — followed by an open community discussion on configuration management challenges across sites.
LDMS Security for Multi-Tenant Monitoring: Working Group Report and Community Discussion — 90 minutes
Running ldmsd as a sidecar inside a job environment is a technically sound architecture for per-job metric collection — but authentication across the container/aggregator boundary is an open problem. A working group formed in March 2026 reports on four identified challenges: identity verification, Munge token portability, UID/GID namespace isolation, and the lack of multi-tenant access control. Two candidate approaches — certificate-based authentication and Unix domain socket transport — are under investigation. The session gathers site input and experience.
Per-Job Monitoring: Data Collection and Delivery Requirements — 90 minutes
A continuation of the 2026 per-job monitoring series. Part one covers PID tracking realities — a prerequisite for GPU metrics, CPU per-core attribution, and multi-node job correlation — and invites sites to share experience with PID visibility and network/RDMA metric attribution. Part two is an open community discussion on access policy: who should see per-job monitoring data, at what granularity, and what that means for how LDMS must be designed.
Each day includes open forum sessions whose topics are determined on the day, based on what the community wants to discuss. These are not scheduled in advance — they emerge from the conversations happening at the conference. Every attendee has a voice in setting the agenda.
Questions or interested in contributing? Contact the program committee Co-chair, Nichamon, at nichamon@ogc.us.