LDMSCON2026

PROGRAM

📅 June 9–11, 2026
🕘 9:00 am – 5:00 pm daily
📍 David Rubenstein Forum, University of Chicago
🌐 In person + remote attendance

Program

Three full days of presentations, lightning talks, group discussions, and networking. The schedule is finalized but may be adjusted as needed. We'll keep this page updated. The full schedule is below.

Session Formats

Presentation — 30 minutes: Deployment experience, results, operational findings, and new LDMS capabilities shared with the community.

Lightning Talk — 15 minutes: Short, focused presentations suited for early-stage work, new tools, implementation experience, or topics that benefit from a concise format.

Focus Area Discussion — 30, 60 or 90 minutes: Structured group discussions — not presentations. The goal is to surface shared challenges, generate community-wide understanding, and identify directions for future work. Every attendee has a voice.

Presentations

Presentations share deployment experience, results, operational findings, and new LDMS capabilities with the community. Each presentation is 30 minutes, including Q&A.

LDMSD: 2026 in Review and 2027 Planning Speaker: Tom Tucker

A two-part session covering what was delivered and learned in 2026 — new capabilities, infrastructure improvements, and lessons from community deployments — then looking ahead to the planned direction for LDMSD in 2027, including open problems the community is working through and what sites can expect in the coming year.

NERSC-Sponsored Enhancements to LDMS 2025 Speaker: John Stile

A presentation of LDMS enhancements delivered in 2025 with NERSC sponsorship, covering job state triggered sampling, high frequency samplers, storage plugin stats, export of file data to metric sets and streams, a stream-to-message guide, UNCORE counters, and WORKFLOW_ID support.

Overview of Advanced Configuration Best Practices on Advanced Technology Platforms Speaker: Mathew Phillips

Establishing production LDMS observability on HPE EX systems across heterogeneous endpoints — RPM-based compute hosts and Slingshot switches running Debian on arm64. Contrasts two mature deployment patterns: a Kubernetes-centric model (CSM) at SNL and a Cloud-Init model (OpenCHAMI) at LANL, situated within a broader pipeline from automated package builds through downstream storage and analysis tooling.

Containerizing Grafana Speaker: Sara Walton

Separating Grafana from the host system and packaging its dependencies in a container enables independent updates, simplified dependency management, and consistency across development and production. Covers the containerized deployment strategy and future work toward a pod-based architecture incorporating analysis scripts and backend services.

User-Based LDMS Deployments Speaker: Jim Brandt

At Sandia, users are beginning to realize the benefits of monitoring both system and application utilization and performance characteristics. This talk presents how user-driven configuration, storage, and access management for add-on telemetry can be accomplished across a variety of scenarios, along with considerations for provisioning common analysis and visualization tooling.

LDMS Scorecarding Speaker: Vanessa Surjadidjaja

An introduction to job scorecarding — enabling users to view their job's usage of CPU, memory, and I/O as percentages. Covers how Scorecards Report enables attribution of performance impacts on production application runtimes, scaling across ten systems, and the recent incorporation of PostgreSQL into the scorecard pipeline.

LDMS Daemon Stats: Monitoring the Monitoring Speaker: Jim Brandt

A look at the statistics available for monitoring the health of LDMS data collection, transport, and storage infrastructure — how to use them to diagnose conditions, and work in progress to enable always-on analysis and low-latency notification of problems using the same tools available to users.

Refine: A Robust Approach to Unsupervised Anomaly Detection for Production HPC Systems Speaker: Efe Sencan

Traditional unsupervised anomaly detection assumes clean training data — but real HPC environments are contaminated by anomalies from resource contention, software bugs, and hardware failures. Refine is a VAE-based framework that iteratively removes high-error samples during training, achieving an F1-score of 0.88 with up to 10% anomaly contamination and 100% accuracy on production data from the Eclipse cluster.

Use of LDMS Framework for Low-Latency Feedback Loops Speaker: Jim Brandt

The LDMS message bus facility enables bi-directional event-based data flow. This talk presents work in progress to use the new message channel service for sending application progress and performance data to an analysis cluster and enabling feedback in the reverse direction — along with use cases and on-demand setup and teardown of job-based message channels.

Maestro Speaker: Nick Tucket

An overview of using Maestro for LDMS configuration management, monitoring, and load balancing — covering how Maestro simplifies operating LDMS at scale across complex deployments.

PID Data Analysis Tooling at Hyperscale Speaker: Alex Knigge

The volume of per-process data LDMS collects requires modern tooling to turn into actionable insight. This talk covers tooling being developed at Sandia to analyze PID data at scale — per-job executables, node hours, user frequency — and what system administrators, application owners, and code development managers can do with this data to optimize platforms and prioritize development decisions.

Lightning Talks

Lightning talks are short, focused presentations — 15 minutes each, including Q&A — suited for early-stage work, new tools, implementation experience, or topics that benefit from a concise format.

Always-On Detection of Application Wait and Imbalance with LDMS Speaker: Damian Dechev

OS-level CPU utilization metrics often report 100% occupancy whether a thread is computing or spinning on a lock. This talk presents an LDMS plugin that uses hardware PMU counters — CPU cycles, retired instructions, and cache misses — to derive IPC in real time and detect active idling that conventional metrics cannot see, validated against MPI benchmarks on Sapphire Rapids processors.

Exploration of Memory Bandwidth Measurements with PerfEvent2 and Other LDMS Plugins Speaker: Quintin Jimenez

An investigation of Sapphire Rapids performance using perfevent2, a recently published LDMS sampler that collects data from both core and uncore counters for memory-bandwidth visibility on modern Intel processors. Compares perfevent2 against other LDMS samplers and LIKWID, and characterizes measurement overhead on testbed benchmarks and production applications.

Containerized Development Workflows for Ansible-Configured Monitoring Data Collection and Analysis Clusters Speaker: Christopher Sullivan

Using containers to develop, validate, and test an Ansible-based LDMS deployment workflow enables rapid iteration, automated CI/CD validation, and parallel development without shared environment conflicts. Covers the container-based deployment architecture, practical benefits, and key limitations prompting evaluation of virtual machines as an alternative.

From Deployment and Debug to Documentation Speaker: Ben Griego

Debugging sessions often produce working solutions whose steps are poorly documented or lost. A session-capture script records the debugging process using built-in Linux utilities, generating a structured report that an AI agent ingests into a searchable Confluence page — building a reusable knowledge base aggregated across sessions.

Improving HPC System Administration, Efficiency, and Uptime Using LDMS Metrics and Time Series Analysis Speaker: Nathan Nail

Administrators of large HPC clusters often lack the information needed to diagnose failing jobs in time to prevent downtime. This talk presents a method using dynamic time warping (DTW) on LDMS-collected metrics — with fault injection via the HPAS suite to simulate real-world anomalies — to detect abnormal behavior and potentially cancel problematic jobs before they cause system crashes.

Portable, Scalable Tools for HPC-Scale Monitoring Data Exploration Without a Supercomputer Speaker: Joao Rafael Da Silva Coelho

Exploring LDMS monitoring data has typically required cluster-scale resources. This talk presents a portable, modular pipeline — ingestion, normalization, validation, metric selection, and analysis — that enables job-level performance characterization on a standard workstation, and extends to job grouping and workload characterization intended to inform low-latency schedulers like Slurm.

Looks Real, Acts Real? Evaluating Synthetic HPC Logs Speaker: Ana Solorzano

Synthetic HPC logs supplement real datasets, preserve privacy, and generate more diverse workloads for tasks like outlier detection. Evaluates five generative approaches — CTGAN, TVAE, CopulaGAN, DoppelGANger, and diffusion models — across logs from four HPC systems, assessing both distributional fidelity and practical utility for downstream tasks including machine learning and scheduling algorithms.

Focus Area Discussions

Focus area sessions are structured group discussions. The goal is to surface shared challenges, generate community-wide understanding, and identify directions for future work. Every attendee has a voice.

Managing LDMSD Configuration Across a Complex HPC Center with YAML and Ansible Facilitator: Gary Lawson

A production experience report on using LDMSD's YAML configuration features and Ansible to manage the daemon hierarchy across a data analysis cluster and ten HPC systems — covering the responsibility-per-cluster approach and use of etcd, maestro, LDMS, SOS, and OWS — followed by an open community discussion on configuration management challenges across sites.

LDMS Security for Multi-Tenant Monitoring: Working Group Report and Community Discussion Facilitator: Narate Taerat

Running ldmsd as a sidecar inside a job environment is a technically sound architecture for per-job metric collection — but authentication across the container/aggregator boundary is an open problem. A working group formed in March 2026 reports on four identified challenges: identity verification, Munge token portability, UID/GID namespace isolation, and the lack of multi-tenant access control. Two candidate approaches — certificate-based authentication and Unix domain socket transport — are under investigation. The session gathers site input and experience.

Per-Job Monitoring: Data Collection and Delivery Requirements Facilitator: Nichamon Naksinehaboon

As HPC nodes increasingly run multiple concurrent jobs, node-level metrics alone no longer answer the questions that matter: how much CPU did Job 123 actually use, and was my job slow because it was co-scheduled with a memory-intensive neighbor?

This session reports on per-job monitoring investigations with LDMS across three prior user group talks: the multi-tenant framing and tenant UUID tagging approach; per-job CPU, memory, IO, and network metrics via cgroup v2 and /proc/<pid>; and per-job GPU monitoring using DCGM for exclusive and MIG allocation modes under Slurm and Flux. For each resource type we summarize what is collectible, at what granularity, which configuration prerequisites apply, and where the gaps remain.

The second part examines what it takes to deliver that data: running a sidecar ldmsd alongside jobs, the effect of namespace isolation on metric visibility, and the boundary-crossing challenges when host-side and job-side identifiers diverge.

The session is structured to generate discussion, not deliver finished designs. Two embedded discussion pauses invite the audience to surface site-specific constraints, validate assumptions, and help prioritize what to build first.

Characterizing Back-end Storage Performance for Monitoring & Analysis Facilitators: Kathleen Shoga & Jim Brandt

Different sites use widely varying storage technologies for monitoring data aggregation and analysis. This session seeks to identify use-case scenarios, key metrics, and methodologies for benchmarking and comparing storage and database technologies in a use-case relevant way — with expected outcomes including high-level use-case descriptions, common performance metrics, and ideas for low-overhead pre-purchase testing.

Community Sessions

LDMS User Feedback: Tell Us What You Think Facilitator: Jim Brandt

A structured community input session — share your experience with LDMS: what's working, what's challenging, and what you'd like to see improved.

LDMS User Feedback: What We Heard Facilitator: Jim Brandt

Follow-up to the morning's community input session. The facilitator shares the results and key themes from the feedback collected, and opens the floor for discussion.

Page updated

Google Sites

Report abuse