November 2025
Times are United States Central Time
Making Bouquets from Blossoms: Standard-informed Communication Innovations in MPI Advance
Large-scale programming systems and the communication systems that support them have transformed dramatically in the past decade with emergence of new computer architectures and computational problems to address. Many of these innovations blossomed outside existing communication standards and community implementations, allowing for rapid innovation. It also resulted in communication interfaces that occasionally had important semantic aspects only implicitly defined by a single implementation, making them hard to port, reason about, or extend.
In this talk, I discuss our research on standard-informed and informing communication system innovation in the open source MPI Advance communication library. This research focuses on integrating insights from modern hardware, other communication systems, and the MPI standard itself to address the needs of modern systems and applications. Importantly, our MPI Advance abstractions are self-contained bouquets of innovation in separate areas such as GPU triggering, system topology awareness, and new language interfaces instead of a fully integrated communication system. This allows MPI Advance packages to be both easily deployed in existing applications and on top of existing communication libraries and systems; it also makes it ripe for refinement and integration into production communication system implementations and standards. Finally, I also discuss the importance of the core MPI standard concepts in informing and guiding our research, and how the MPI community could adapt to further encourage innovation around the standard.
Patrick Bridges is a Professor of Computer Science and Director of the Center for Advanced Research Computing at the University of New Mexico. Bridges received his Bachelor of Science in Computer Science from Mississippi State University in 1994 and his Ph.D. in Computer Science from the University of Arizona in 2002. Since joining the University of New Mexico in January of 2003, his research has focused on system software for high performance computing systems, particularly operating systems and communication systems software. He and his collaborators and students have researched and developed key system software innovations on HPC application performance measurement and monitoring, virtualizing large-scale computing systems, and optimizing communication system performance for high-end computing systems and applications, resulting in over 150 publications in these areas and over $15M in research funding from a wide range of sponsors. He also served the HPC community in a wide range of technical leadership roles in conferences such as SC, HPDC, and IEEE Cluster, and is a member of the ACM and IEEE.
Refreshments available in the common area.
Mike Söhner, Christoph Niethammer
MPI provides a flexible C-API to communicate data of various types between a set of distributed processes over high-speed interconnects in HPC systems. Data buffers are described using MPI-Datatypes, which specify the type and layout of the data to be transmitted. To construct these datatypes, users must manually describe the memory layout of buffer elements via the MPI-API. However, modern applications are typically written in object-oriented C++, which offers significant advantages over C, including type safety and metaprogramming capabilities. In this work, we introduce a new C++-API and datatype engine that leverage C++ language features such as concepts, ranges, and the upcoming reflection to extract the necessary datatype information for the user at compile-time. This approach simplifies the user’s work, enhances code safety by eliminating manual datatype construction and offers previously unavailable possibilities. Our measurements demonstrate that this interface introduces no performance overhead and, in some cases, even improves performance.
Amirhossein Sojoodi, Mohammad Akbari, Hamed Sharifian, Ali Farazdaghi, Ryan E. Grant, Ahmad Afsahi
Optimizing GPU-to-GPU communication is a key challenge for improving performance in MPI-based HPC applications, especially when utilizing multiple communication paths. This paper presents a novel performance model for intra-node multi-path GPU communication within the MPI+UCX framework, aimed at determining the optimal configuration for distributing a single P2P communication across multiple paths. By considering factors such as link bandwidth, pipeline overhead, and stream synchronization, the model identifies an efficient path distribution strategy, reducing communication overhead and maximizing throughput. Through extensive experiments on various topologies, we demonstrate that our model accurately finds theoretically optimal configurations, achieving significant improvements in performance, with the average of less than 6\% error in predicting the optimal configuration for very large messages.
James B. White III
Near the full scale of exascale supercomputers, latency can dominate the cost of all-to-all communication even for very large message sizes. We describe GPU-aware all-to-all implementations designed to reduce latency for large message sizes at extreme scales, and we present their performance using 65536 tasks (8192 nodes) on the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. Two implementations perform best for different ranges of message size, and all outperform the vendor-provided MPI_Alltoall. Our results show promising options for improving implementations of MPI_Alltoall_init.
Thomas Erbesdobler, Amir Raoofy, Ehab Saleh, Josef Weidendorfer
Programmable smart network devices are heavily used by cloud providers, but typically not for HPC. However, they provide opportunities for off-loading computations, in particular for collective operations, which are important for data intensive workloads in classic HPC and ML training. In this paper, we present a prototype called mpitofino to enable offloading MPI collectives (in particular reductions) onto smart switches over an Ethernet fabric. We target Intel’s programmable Ethernet switches equipped with a Tofino ASIC, and we use the P4 programming language to process collective packets on the chip’s low-latency data path. We demonstrate how the flexibility of P4 enables us to use RoCEv2 as protocol, utilizing RDMA hardware support on the nodes’ NICs. Furthermore, we implement mpitofino as a collective provider in Open MPI and discuss its desirable scaling characteristics. Finally, we demonstrate that mpitofino can achieve data throughput close to the 100GBit/s line rate.
Lunch is not provided by the workshop, so go check out St. Louis and we will reconvene at 2:00.
Shannon Kinkead, Jackson Wesley, Whit Schonbein, David DeBonis, Matthew Dosanjh, Amanda Bienz
Performant all-to-all collective operations in MPI are critical to fast Fourier transforms, transposition, and machine learning applications. There are many existing implementations for all-to-all exchanges on emerging systems, with the achieved performance dependent on many factors, including message size, process count, architecture, and parallel system partition. This paper presents novel all-to-all algorithms for emerging many-core systems. Further, the paper presents a performance analysis against existing algorithms and system MPI, with novel algorithms achieving up to 3x speedup over system MPI at 32 nodes of state-of-the-art Sapphire Rapids systems.
John Biddiscombe, Mikael Simberg, Raffaele Solca, Alberto Invernizzi, Auriane Reverdell, Rocco Meli, Joseph Schuchart
Integrating asynchronous MPI messaging with tasking runtimes requires careful handling of request polling and dispatching of asso- ciated completions to participating threads. The new C++26 Senders (std::execution) library offers a flexible collection of interfaces and templates for schedulers, algorithms and adaptors to work with asynchronous functions—and it makes explicit the mechanism to transfer execution from one context to another—essential for high performance. We have implemented the major features of the Senders API in the pika tasking runtime and used them to wrap asynchronous MPI calls such that messaging operations become nodes in the execution graph with the same calling semantics as other operations. The API allows us to easily experiment with different methods of message scheduling and dispatching completions. We present insights from our implementation on how application performance is affected by design choices surrounding the placement, scheduling and execution of polling and completion tasks using Senders.
Refreshments in the common area.
MPI at Exascale and Beyond: Challenges and Progress from MPICH
We are now firmly in the exascale era of high-performance computing, and MPI remains the cornerstone of large-scale scientific software. As testament to MPI’s success, MPICH is the most recent recipient of the ACM Software System Award. The award “recognizes MPICH for powering 30 years of progress in computational science and engineering by providing scalable, robust, and portable communication software for parallel computers”. And yet, MPI faces numerous challenges to stay relevant in today’s computing landscape. The ascendance of AI/ML challenges MPI’s central role in large-scale computing. In this talk, we will share recent advances from the MPICH team at Argonne National Laboratory and discuss how these efforts address today’s challenges while preparing MPI for the future.
Moderator: Joseph Schuchart (Stony Brook University)
Panelists:
Edgar Gabriel (AMD)
Maria Garzaran (Intel)
Rich Graham (NVIDIA)
Hari Subramoni (The Ohio State University)