15-minute Lightning Talks
Application-Aware Performance Monitoring with Kokkos Tools + LDMS Software Integration
Presenter: Vivek Kale, Sandia National Laboratories
We present enhancements and integrations of the Kokkos performance portable parallel programming ecosystem to leverage performance monitoring capabilities via LDMS. We highlight (1) analysis of application-specific variables in scientific computing via the Kokkos ecosystem and (2) new build system updates for packaging and automatic software
configuration. For (1), we will show modifications and ongoing work for the Kokkos ecosystem that enable performance monitoring strategies to incorporate scientific domain knowledge instead of traditional metrics (simple example: molecules/second instead of FLOPS). The modifications involve the Kokkos Tools GPU vendor connectors, e.g., NVIDIA nvtx-connector, and changes to the Kokkos backends and associated programming models like OpenMP, OpenACC and CUDA, to correlate these metrics with low-level profiling. We will show how we cna effectively capture application-specific variables of a Trillinos-inspired Kokkos solver program via modifications in the Kokkos ecosystems. For (2), we provide recent updates to the Kokkos ecosystem's software integration with LDMS. Specifically, we discuss (a) a drafted integration of the LDMS connector directly in the Kokkos Tools repository (b) a Spack package for Kokkos Tools enabling integration of LDMS via Spack's depends_on functionality and (c) generalization of utilities for the LDMS Kokkos Tools connector - particularly sampling and filtering - in Kokkos Tools for use by other connectors. Finally, we make note of how this work impacts the Application Data Connector (ADC). We show how our extensions can be utilized by ADC, and how our Kokkos Tools + LDMS software integration efforts also makes integration of Kokkos Tools with ADC much easier.
Deployment of LDMS on CSM Systems
Presenter: M. Aiden Phillips, Los Alamos National Laboratory
LA-UR-25-25152
We will present the building and installation of the Lightweight Distributed Metric Service (LDMS) for Cray System Management (CSM) systems. We will discuss some issues regarding the existing packaged version of LDMS which ships with CSM systems, and reiterate why it proves more efficient to build and deploy yourself instead of using a prepackaged version. Another facet of the build process which might be of interest to other users is building LDMS RPMs inside containers for multiple different distributions of Linux. We discussion the future work of emulating ARM64 architecture in containers to build the RPMs as well. This will lead into the discussion of using Ansible alongside Jinja to deploy the Samplers, and the use of replica sets to deploy the Kubernetes pods for Aggregators on the Management plane. As for data collection, the Kafka plugin is used to transfer data to Splunk for short term visualization, and to a VAST database for long term storage.
Determining Optimal MPI Communication Function Using LDMS Feedback Loop and ML
Presenter: Vanessa Surjadidjaja, Sandia National Laboratories
This presentation will provide a technical discussion on the ongoing development of tools to help HPC/MPI users determine their optimal MPI communication functions. Many users rely on trial and error, submitting multiple jobs and benchmarking to determine which communication function best suits their needs. However, for MPI users in the HPC space, job allocations do not necessarily guarantee the same nodes or machine racks. While there are ways to request nodes, submitted jobs will need to wait for every specified node to be free. Here, we put forward the idea of using LDMS metric sets with a machine learning backend. Using LDMS’s metric sets, we can use our previously trained dataset from other runs to make predictions on how the current run will perform with its implemented MPI function. This offers users an informed choice of MPI message passing functions. We plan to incorporate LDMS’s feedback loop capabilities to help users automate the process of cancelling suboptimal jobs, changing their MPI message passing function, rebuilding, and resubmitting jobs.
Incentive-Based Power Efficiency Mechanism on the Fugaku Supercomputer
Presenter: Devesh Tiwari, Northeastern University
This talk will describe the deployment and operational experience of a novel incentive-based power-control strategy on the Fugaku supercomputer. This incentive-based program, termed Fugaku Points, provide knobs to users to apply power control functions to improve the overall power efficiency of the supercomputer toward achieving HPC sustainability in terms of its environmental implications. We will discuss new operational opportunities, challenges, and future directions. This work was led by Ana.
Job Grouping Based Intelligent Resource Prediction Framework
Presenter: Beste Oztop, Boston University
In large-scale computing systems, High-Performance Computing (HPC) users estimate and request batch job resources according to the best of their knowledge. Manual resource requests can impact overall system efficiency and resource utilization due to over- and underestimations. Overestimation of batch job resources leads to increased wait time, job backfilling inefficiencies, wasted compute power, and unused memory resources. Underestimation of the required execution time, number of processors, or memory size, on the other hand, can lead to early job terminations. As we enter the exascale era, it is important to utilize resources more efficiently than ever, reduce poor job scheduling, and increase overall system efficiency. Existing resource managers in HPC systems lack built-in mechanisms for predicting resource requirements of batch jobs. We address this challenge by designing a machine learning based resource recommendation framework that uses historical batch job information from workload managers to predict three key parameters for batch jobs: the execution time, the maximum memory size, and the maximum number of CPU cores required. In contrast to existing work in the literature focusing on resource prediction, we introduce a resource suggestion framework that considers both underestimates and overestimates in the resource prediction of batch jobs. We group similar jobs and train individual regression models for each group to provision the necessary resources. We then give suggestions and a recommended batch job script to the HPC user. Our framework outperforms the baseline method with no grouping mechanism by achieving over 98% success in eliminating underpredictions and reducing the amount of overpredictions.
LDMS in the clouds
Presenter: Jim Brandt, Sandia National Laboratories
LDMS can be easily deployed in a cloud environment. This lightning provides information on how a cloud deployment configured for "burst to cloud" scenarios can incorporate the same LDMS infrastructure
utilized in local clusters including data being aggregated to a common storage and analytics cluster and feedback to application and system software processes.
Autonomous Kokkos Performance Optimization Through APEX-Kokkos Integration and a Monitoring Feedback Loops
Presenter: Vivek Kale, Sandia National Laboratories
We present an autonomous performance optimization framework leveraging APEX (Autonomic Performance Environment for eXascale) integration with Kokkos Tools to enable intelligent autotuning through continuous performance feedback. Our approach first uses profiling to augment APEX autotuning capabilities for Kokkos; we then establish closed-loop performance optimization of Kokkos application programs where system-wide monitoring drives automatic parameter adjustments for scientific applications. Our methodology extends new autotuning features available in Kokkos (released in Kokkos 4.5 in December 2024) by integrating APEX’s autonomic capabilities with the Kokkos Tools Tuning Interface. APEX’s nested autotuning support enables sophisticated search strategies for complex optimization scenarios for Kokkos applications, including simultaneous execution policy selection while optimizing internal parameters. The APEX-Kokkos autotuning capabilities enables runtime autotuning of optimal algorithmic-level parameters (e.g., solver selection) coupled with tuning Kokkos library parameter configuration), e.g., team size, vector length. Our extensions also include Initial low-level hardware profiling phases via Kokkos Tools GPU vendor connectors and PAPI connector augment APEX’s autotuning decisions with detailed performance characterization, enabling more informed performance tuning strategies. With this, we consider autotuning via feedback from performance monitoring data from a tool like LDMS. To do so, we develop a simplified standin library for LDMS to demonstrate the core feedback mechanisms from the LDMS Kokkos Tools to Kokkos Tools APEX connector. Our approach allows us to validate the fundamental concepts of continuous performance feedback integration without the complexity of the full LDMS infrastructure. We note that this library can be used to direct performance feedback from any profiling events external to the Kokkos runtime system, e.g., MPI profiling data, to tune Kokkos applications. We demonstrate the APEX Kokkos Tools connector and the standin library on Kokkos applications motivated by DOE exascale science applications especially those pertinent to Sandia. We consider (a) a Kokkos matrix multiplication and (b) a Kokkos MueLu multi-grid solver based on Trilinos. Using the APEX connector by itself (without the performance monitoring standin) offers a 2.4x speedup of the matri multiplication on NERSC's Perlmutter. Using the performance monitoring standin ensures system-wide performance awareness with minimal/near-zero performance overhead, enabling feasibility of Kokkos application parameter tuning based on monitoring data. The new Kokkos auto-tuning features via APEX along with prototyped extensions for feedback from LDMS addresses critical challenges in exascale computing where manual tuning becomes impractical due to system complexity and application diversity. Key contributions include: (1) incorporating APEX runtime autotuning capabilities within Kokkos Tools, (2) augmenting APEX Kokkos autotuning of (1) with low-level profiling, (3) nested autotuning strategies for Kokkos, allowing for optimizing algorithmic choices and Kokkos execution parameters together and (4) a proof-of-concept standin library demonstrating LDMS feedback loop concepts for use in Kokkos runtime autotuning with APEX. Combining APEX’s autonomic decision-making with Kokkos’ portability and our monitoring proof-of-concept allows creating intelligent systems that adapt computational strategies based on application execution as well as real-time conditions on an HPC system. This represents progress toward self-optimizing scientific computing environments, with our proof-of-concept validating the potential for use of LDMS to optimize Kokkos applications.
Characterizing HPC Codes and Associated Resource Consumption
Presenter: Benjamin Allan, Sandia National Laboratories
In the evolving landscape of modeling and simulation (ModSim) analyses, understanding the intricate relationship between application behavior and resource consumption is critical to improving workflow efficiency. The Application Data Collection (ADC) project is designed to facilitate this understanding by providing a comprehensive framework for collecting and analyzing the utilization of codes and their impact on resource allocation. By merging data from the physics codes with node utilization metrics, ADC aims to provide the groundwork that can be mined for insights to enhance operational efficiency and inform strategic decision-making. The ADC project is designed to collect and provide access to data in a secure, robust, and flexible manner. Data points are semi-structured, support domain-specific data schemas, and may be collected at various time points and implementation layers. In addition to collecting discrete data points, ADC supports the discovery and documenting of formal and informal workflows used by ModSim analysts. ADC facilitates both standard and ad-hoc queries. Query flexibility is crucial for accommodating a diverse group of data consumers. Furthermore, the ability to apply advanced data mining techniques powered by artificial intelligence and machine learning (AI/ML) allows for deeper insights into usage patterns and resource consumption trends. Another key feature of the ADC architecture is its plugin-based approach, which allows for seamless extensions to other existing data collectors, such as Adiak and Caliper from Lawrence Livermore National Laboratory (LLNL). This modular design enhances the adaptability of the system and fosters collaboration across various research teams and institutions. The ADC project represents a step forward in understanding the utilization of codes and their implications for resource consumption. With its innovative data collection system, AI/ML supporting capabilities, and plugin-based architecture, ADC is an asset for CompSim Analysts and resource planners seeking to optimize utilization and acquisition of computational resources.
Common Job Reporting Implementation
Presenter: Benjamin Schwaller, Sandia National Laboratories
This presentation describes an implementation of slurm and LDMS ingest and how those data streams are used to create per-job one-line summarizations into a PSQL database. The goal of this database is to provide common job summarizations across LDMS installations and provide users with a common output about resource usage statistics of their jobs. This implementation is intended to be implementable at all sites through usage of the common analysis framework. This presentation will engage the audience about summarization figures of merit and hopefully create newer, better metrics. Future work in this area will be to make common center-level reporting based on these summaries and analyses to highlight possible resource usage issues / patterns to users.
How To Compile a Sampler Plugin Outside of LDMS
Presenter: Christopher Morrone, Lawrence Livermore National Laboratory
Maestro
Presenter: Nick Tucker, Open Grid Computing
This presentation covers the Maestro application. Maestro is a LDMS configuration, monitoring, and load balancing application. It will cover how Maestro works, Maestro's features, deploying Maestro using a YAML configuration file, Maestro load balancing methods, and multi-instance Maestro.
Monitoring System Deployment Management Modernization at Sandia
Presenter: Jennifer Green, Sandia National Laboratories
Monitoring system deployment management is a critical yet often overlooked aspect of ensuring continuous data collection in complex infrastructures. At Sandia National Laboratories, we have focused on enhancing the configuration management, packaging, testing, and deployment of these tools to ensure maintainability, correctness, and portability. Our efforts extend to supporting plugin installations for customer use to collect LDMS Streams data, post-processing and collection tools for automated analyses, and support of various Linux distributions, architecture targets, and specialized hardware, where the software must operate and be evaluated before being considered for production use. With an increased emphasis on continuous monitoring of the data center, the benefits of our current and projected capabilities are providing key insights to stakeholders, pushing for broadening the monitoring infrastructure to covering all production resources. This transition necessitates a turn-key solution and support framework, including building and positioning key skills, utilizing conventional and novel deployment tools and techniques, and generalizing specialized scripts and configurations. This presentation will provide an overview of our packaging, testing, and deployment pipeline, along with our project roadmap as we continue to modernize and scale our infrastructure.
NERSC LDMS @ 0.1Hz to 1Hz sample rates
Presenter: John Stile, National Energy Research Scientific Computing Center
People using LDMS metrics need confidence in the frequency of the data to draw meaningful conclusions. This presentation highlights observed bottlenecks in the data pipeline and describes how horizontal scaling was implemented to address these challenges. As a result, we achieved a stable 1 Hz sampling rate across more than 5,000 nodes, collecting approximately 38,000 metrics per minute on Perlmutter. We also discuss ongoing efforts to enhance pipeline observability and ensure the delivery of actionable insights. I will share our helm chart, scripts to scale and populate ldmsd config files, the script for collecting ldmsd metrics, and our dashboard for monitoring the system.
New Features in LDMS V4.5
Presenter: Tom Tucker, Open Grid Computing
LDMS v4.5 introduces changes that affect core system operations, plugin development, configuration methods, and documentation. This presentation provides an overview of selected features and changes from v4.4, helping attendees understand the transition considerations and new capabilities.
Overview of the new LDMSD Plugin API
Presenter: Christopher Morrone, Lawrence Livermore National Laboratory
Platform Agnostic Power and Energy Monitoring with the Variorum LDMS Plugin
Presenter: Kathleen Shoga, Lawrence Livermore National Laboratory
The integration of Variorum with the Lightweight Distributed Metric Service (LDMS) provides a scalable and flexible infrastructure for comprehensive power and performance monitoring in high-performance computing (HPC) systems. Variorum is a cross-platform library that enables fine-grained access to hardware-level power and energy metrics across diverse architectures, including Intel, AMD, and NVIDIA platforms. LDMS is a lightweight, pluggable monitoring framework designed for efficient collection and transport of time-series performance data. By integrating Variorum as a metric plugin within LDMS, LDMS users can access high-resolution power and energy telemetry data, enabling improved system diagnostics, energy-aware scheduling, and performance optimization. This integration supports extensibility for emerging architectures and is a critical step toward holistic, energy-conscious HPC monitoring solutions. The presentation will include background on Variorum, instructions on deployment of the Variorum LDMS plugin, and a short demo of the Variorum LDMS plugin.
Practical Data Analysis
Presenter: Serge Polevitzky, FedData
LDMS collects a great deal of data, but do any of us really, truly understand what the data is trying to tell us? Looking at a chart may help, but without viewing a chart without appreciating the history of the same values, we won’t get much benefit. Nor will we get much benefit from looking at charts without a statistical basis of interpretation. The talk presents some ideas on the use of statistics to help enlighten the system analyst/computer center director as to what the data is trying to tell them and us. Example: Where is congestion occurring? What constitutes congestion? How long or severe is the congestion, what are coincidental events, (remembering that "correlation does not imply causation"), and is congestion defined the same at one site and differently at another (is so, how do we “share experiences or data” ?)
Using the bi-directional capability of LDMS Stream to provide low-latency feedback to applications
Presenter: Ann Gentile, Sandia National Laboratories
User applications can benefit from knowledge of performance impacting conditions on shared resource components such as networks and file systems. While an analysis system that has global access to compute systems components for a site can identify such conditions, getting the information back to a user applications processes can be difficult to impossible. However, since the LDMS ecosystem spans data transport from user processes to last-level aggregators that can be co-located with run time analyses, there does exist a pathway for direct information flow from these analyses back to application processes. This talk presents work done at Sandia National Laboratories to exploit the bi-directional communication paths presented by the LDMS Stream feature.
Configuring LDMS Clusters with YAML
Presenter: Nick Tucker, Open Grid Computing
This presentation will cover configuration of an entire LDMS cluster within a single YAML file, as well as configuring individual LDMSD's with YAML. The presentation will go over different cluster configuration implementations, such as advertisers, producer listeners, as well as a more conventional producer configuration. The presentation will also cover usage of the ldmsd_yaml_parser to generate a single daemon's v4 configuration file, as well as passing the YAML file directly to a LDMSD using the -y parameter. This presentation will not be "hands on" although example YAML configuration files will be provided to attendees.
Contributing to LDMS Documentation: A Hands-On Tutorial
Presenter: Sara Walton, Sandia National Laboratories
This tutorial will provide a practical introduction to contributing to LDMS documentation, with a focus on improving accessibility and maintainability for new and existing developers. Participants will learn the guidelines for documentation development, including file organization, naming conventions, and required sections. The session will also cover testing and validating documentation updates, as well as how to navigate and understand the LDMS documentation structure on Read the Docs.
LDMS Rail and Message Service
Presenter: Narate Taerat, Open Grid Computing
When an ldmsd (e.g. L2 aggregator) collect large number of sets from another ldmsd (e.g. L1 aggregator) over a single connection, it may not able to keep up with the updating interval due to the single thread processing update completion (and probably storage) over the single connection. Manually configure L2 aggregator to have multiple connections (prdcrs) to the same L1 ldmsd with matching updaters could mitigate the problem, but the configuration could be complicated (e.g. the set names have to be known at configuration time). LDMS Rail, a new feature in LDMS library that is a bundle of connections to a peer, is introduced to handle this issue. In addition to LDMS Rail, we also introduce LDMS Message service. The LDMS Message service replaces `ldmsd_stream` feature and addresses issues identified in `ldmsd_stream`, e.g. unlimited buffer growth. In this tutorial, we will walk through the motivations and concept behind LDMS Rail and LDMS Message, the programming (in C and Python) with LDMS Rail and LDMS Message, and how to configure related parameters in ldmsd.
LDMS Peer Daemon Advertisement: Simplifying Producer-Aggregator Configuration
Presenter: Nichamon Naksinehaboon, Open Grid Computing
The LDMS Peer Daemon Advertisement feature simplifies ldmsd aggregator configuration by automatically creating producers based on peer client advertisements of their existence. In standard deployments, administrators must manually configure each producer in aggregators, creating significant overhead. This manual configuration is nearly impossible in dynamic environments where monitoring targets frequently change. This tutorial presents a hands-on introduction to LDMS Peer Daemon Advertisement, which enables samplers to advertise their presence to aggregators that automatically establish connections based on configurable criteria. Participants will learn to configure both sampler/peer daemons and aggregators to leverage this feature. The tutorial demonstrates hostname pattern matching and IP range filtering to control which producers are automatically added to aggregators. Advanced configurations will cover improving the scalability of data transfer using connection rails and implementing message quotas to manage resource utilization. Through practical exercises, attendees will configure samplers to advertise themselves, set up aggregators to accept advertisements with various filtering options, and monitor the status of advertisements and connections. The tutorial will also show how to manage the automatically created producers on an aggregator and monitor their statuses.
Understanding deployed LDMS infrastructure using daemon statistics and log messages
Presenter: James Brandt, Sandia National Laboratories
LDMS v4.5.x has upgraded performance counters, statistics, and messaging. This tutorial explores how these new capabilities can be utilized by administrators to understand how well provisioned their LDMS infrastructure is relative to their data gathering, transport, and storage technology needs based on a deployed configuration. It will also provide insight into how a deployment might be modified to better accommodate current and planned loads in the event performance limits are being reached on some components.
Streamlined LDMS deployment for standup and simple configurations
Moderator: Jim Brandt and Evan Donato, Sandia National Laboratories
Configuring a monitoring infrastructure for large scale systems incorporating multiple compute, memory, network, and storage technologies can be challenging. However, for initial identification of weak elements and a high-level understanding of how memory, processor, network, and storage are utilized across a system a canned and very streamlined approach might be appropriate. This BoF will present a demonstration of how this can be accomplished followed by an interactive session to: 1) determine how attractive such an approach would be 2) identify key attributes, raw or derived, that people would like to be presented by such an approach 3) identify scenarios for use of this approach 4) identify desired changes/extensions to the presented approach 5) identify relative desire for containerized vs. bare metal sampler deployment and why.
LDMS Data Analysis Q&A
Moderator: Ben Schwaller, Sandia National Laboratories
What do we do with all the data we collect? This BoF will explore what the challenges are with using the data collected through LDMS and work with the audience to ponder paths forward. There will also be time dedicated to discussing new use cases of LDMS data and how to pair with other sources of data.
Delivering Monitoring-driven Knowledge to End Users
Panelists: John Stile, National Energy Research Scientific Computing Center
Deborah Rezenka, Los Alamos National Laboratory
Kathleen Shoga, Lawrence Livermore National Laboratory
Scot Swan, Sandia National Laboratories
Moderator: Jim Brandt, Sandia National Laboratories
This panel will examine the gap between collecting large volumes of monitoring data and delivering meaningful value to end users. Panelists will discuss what system administrators and application users actually need from monitoring systems and the challenges they face in accessing and using that data in operational HPC environments.
Key discussion areas:
- System Administrator Perspectives: Essential metrics, alerts, and insights system administrators require from monitoring data to effectively manage HPC resources and identify issues
- Application User Insights: Critical monitoring data job submitters need to understand application behavior, identify bottlenecks, and improve efficiency
- Data Accessibility Challenges: Barriers and unmet requirements that prevent end users from effectively accessing and using monitoring information