November 2024
Times are United States Eastern Time
9:00 - 10:00 : Featured Speaker: Nikoli Dryden
Communication in Deep Learning and the Role of MPI (slides)
Abstract: Deep learning workloads are reaching unprecedented scales, which has brought renewed focus to the increasingly critical role of communication. In this talk, I provide an overview of the communication patterns and operations common in state-of-the-art training and inference workloads. I then discuss the current status of MPI in deep learning frameworks and some of the pain points in its use before concluding with some ideas on MPI’s role and future in deep learning.
Nikoli Dryden is a research scientist in the Informatics Group of the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL). Previously, he was an ETH Postdoctoral Fellow at ETH Zurich, working with Professor Torsten Hoefler in the Scalable Parallel Computing Laboratory. He completed his PhD in computer science at the University of Illinois Urbana-Champaign, advised by Professor Marc Snir. During his PhD and postdoc, Nikoli worked heavily with members of CASC and many other collaborators. Nikoli's research focuses on the intersection of high-performance computing and machine learning. He is particularly interested in scalable training of deep neural networks and applying neural networks to scientific and computational simulation datasets. He also works on parallel algorithms and runtimes, graph analytics, and communication and performance optimization.
10:00 - 10:30 : Coffee Break
Refreshments available in the common area.
10:30 - 11:00 : Modes, Persistence and Orthogonality: Blowing MPI Up
The MPI specification provides a restricted form of persistence in point-to-point and collective communication operations that purportedly enables libraries to amortize precomputation and setup costs over longer sequences of identical communication operations. Because of the way that MPI has chosen to represent semantics and modes of communication, further additions and modifications to the MPI specification often came and come at the cost of a (combinatorial) blow-up in the number of interface functions. We discuss how to exploit orthogonality and separation of concerns more thoroughly to prevent the proliferation of concrete interface functions while still providing essentially the same persistence as current MPI and without any additional burden on library implementers. We introduce new variants of persistence, which we call pairwise and relaxed persistence. Our concrete proposals contribute to the discussion about why MPI is so huge and what could or should be done about that.lable in the common area.,
11:00 - 11:30 : Improving MPI Language Support Through Custom Datatype Serialization
Jake Tronge, Joseph Schuchart, Lisandro Dalcin, Howard Pritchard
Exascale applications are increasingly being written in modern languages such as Python, Julia, C++, and Rust. The Message Passing Interface (MPI), the de facto standard for parallel computing, only defines interfaces for C and Fortran, languages that are very different from these modern languages, often containing more complex types and representations incompatible with MPI. The existing derived datatype interface is widely used for older applications, but fails to work efficiently for types containing multiple pointers, requiring application-specific initialization, or serialization. Applications written in these languages can still use MPI, but at the cost of complicated address manipulation or high overhead. This work proposes a new datatype interface for MPI giving more control to the application over buffer packing and the wire representation. We built a prototype for this interface, demonstrating it with Rust, Python, and C++, highlighting key concerns of each language and showing the improvements provided.
11:30 - 12:00 : MPI Progress for All
Hui Zhou, Robert Latham, Ken Raffenetti, Yanfei Guo, Rajeev Thakur
The progression of communication in the Message Passing Interface (MPI) is not well defined, yet it is critical for applications to achieve effective computation and communication overlapping. The opaque nature of MPI progress poses significant challenges in advancing MPI within HPC practices. First, the lack of clarity hinders the development of explicit guidelines for enhancing computation and communication overlap in applications. Second, it prevents MPI from seamlessly integrating with contemporary programming paradigms. Third, it limits the extension of MPI functionalities from user space. In this paper, we examine the role of MPI progress by analyzing the implementation of MPI messaging. We generalize the asynchronous communication pattern and identify key factors influencing application performance. We propose a set of MPI extensions designed to enable users to construct and manage an efficient progress engine explicitly. We compare our approach to previous efforts in the field, highlighting its reduced complexity and increased effectiveness.
GPUs have become the dominant type of accelerators for high-performance computing and artificial intelligence. To support these systems, new communication libraries have emerged, such as NCCL and NVSHMEM, providing stream-based semantics and GPU-initiated communication. Some of the best performing communication libraries are unfortunately vendor-specific, and may use load-store semantics that have been traditionally underused in the application community. Moreover, MPI has yet to define explicit GPU support mechanisms, making it difficult to deploy the message-passing communication model efficiently on GPU-based systems. MPI 4.0 introduced Partitioned point-to-point communication, which facilitates hybrid-programming models. Partitioned communication is designed to allow GPUs to trigger data movement through a persistent channel. We extend MPI Partitioned to provide intra-kernel GPU-initiated communication and partitioned collectives, augmenting MPI with techniques used in vendor-specific libraries. We evaluate our designs on an NVIDIA GH200 Grace Hopper Superchip testbed to understand the benefits of GPU-initiated communication on NVLink and InfiniBand networks.
12:30 - 2:00 : Lunch Break
Note: Lunch is not provided by the workshop, so go check out Atlanta and we will reconvene at 2:00.
Beatnik is a novel open source mini-application that exercises the complex communication patterns often found in production codes but rarely found in benchmarks or mini-applications. It simulates 3D Raleigh-Taylor instabilities based on Pandya and Shkoller’s Z-Model formulation using the Cabana performance portability framework. This paper presents both the high-level design and important implementation details about Beatnik, along with four benchmark setups for evaluating different aspects of HPC communication system performance. Evaluation results demonstrate Beatnik's scalability on modern accelerator-based systems using weak and strong scaling tests up to 1024 GPUs, along with Beatnik's ability to expose communication challenges in modern systems and solver libraries.
Message matching is a critical process ensuring the correct delivery of messages in distributed and HPC environments. The advent of SmartNICs presents an opportunity to develop offloaded message-matching approaches that leverage this on-NIC programmable accelerator, retaining the flexibility of software-based solutions (e.g., tailoring to application matching behaviors or specialization for non-MPI matching semantics) while freeing up CPU resources. This can be especially beneficial for intensive I/O systems, such as those protected with PQC. In this work, we propose a bin-based MPI message approach, Optimistic Tag Matching, explicitly designed for the lightweight, highly parallel architectures typical of on-path SmartNICs. We analyze several MPI applications, showing how most of them present a matching behavior suitable for offloading with the proposed strategy (i.e., low queue depths). Additionally, we show how, in those scenarios, offloaded optimistic matching maintains message rates comparable to traditional on-CPU MPI message matching while freeing up CPU resources.
3:00 - 3:30: Coffee Break
Refreshments available in the common area.
3:30 - 4:15: Invited Speaker: Keith Underwood
MPI and Ultra Ethernet: How it was Designed to Work
4:15 - 5:30: Panel: The Future of MPI in the Era of AI
Panelists:
Anthony Skjellum (Tennessee Tech)
Purushotham Bangalore (University of Alabama)
Keith Underwood (HPE)
Nikoli Dryden (LLNL)
Moderator: Matthew Dosanjh (Sandia National Laboratories)