In a typical High-Performance Computing (HPC) system, compute resources are separated from the persistent storage by a shared network infrastructure. This architecture was driven by compute-intensive scientific applications that usually run on these systems. Storage resources were often seen as a peripheral component of the entire system design. Modern scientific applications are required to process large volumes, velocities, and varieties of data, leading to an explosion of data requirements and increased complexity of use. Many applications spend a significant time of the overall execution in performing I/O making storage a vital component in performance. The evolution of modern storage technologies is driven by the proliferation of larger, more complex scientific instruments and sensor networks to collect extreme amounts of data.
However, as we move towards the exascale era, most of these storage systems face significant challenges in performance, scalability, complexity, and limited metadata services, creating the so-called I/O bottleneck which leads to less scientific productivity. The intense and periodic nature of I/O operations in addition to extreme amounts of data stress the existing underlying storage systems beyond their limits.
I/O is now the biggest challenge in modern supercomputers and storage subsystems are called to efficiently support both compute- and data-intensive workloads.
A new memory and storage hierarchy is a reality and modern storage subsystems need to adapt and efficiently support advanced I/O buffering.
In the age of Big Data, extracting meaningful knowledge by analyzing data has become crucial to scientific discovery. The increasing ability of powerful HPC systems to run data-intensive problems at larger scale, at higher resolution, and with more elements gave birth to a new category of computing, namely High-performance Data Analytics (HPDA), which involves sufficient data volumes and algorithmic complexity to require HPC resources. However, performing data analysis using HPC resources can lead to performance and energy inefficiencies . Traditional offline analysis results in excessive data movement which in turn causes unnecessary energy costs. Alternatively, performing data analysis inside the compute nodes can eliminate the above mentioned redundant I/O, but can lead to wastage of expensive compute resources and will slow down the simulation job due to interference. In fact, the tools and cultures of HPC and Big Data have diverged, to the detriment of both. This divergence led HPC sites to employ separate computing and data analysis clusters. There is no "one solution for all" approach. Modern scientific workflows require both high-performance computing and high-performance data processing power. A successful merging of both is required.
Traditionally, I/O is performed with memory-to-disk endpoints. Furthermore, scientific applications often demonstrate bursty I/O behavior. Typically, in HPC workloads, short, intensive, phases of I/O activities, such as checkpointing and restart, periodically occur between longer computation phases. The intense and periodic nature of I/O operations stresses the underlying parallel le system and thus, stalls the application. To reduce the I/O performance gap, modern storage subsystems are going through extensive changes, by adding additional levels of memory and storage in a hierarchy. To alleviate the performance gap between main memory and the remote disk-based PFS, modern supercomputer designs employ new hardware technologies, such as Non-Volatile RAM (NVRAM), Non-Volatile Memory Express devices (NVMe), Solid-State Drives (SSD), and dedicated shared buffering nodes (e.g., burst buffers), in a heterogeneous layered memory and storage hierarchy, we call Deep Memory and Storage Hierarchy (DMSH). Each layer of DMSH is an independent system that requires expertise to manage, and the lack of automated data movement between tiers is a significant burden currently left to the users. The underlying storage system will need to be updated to handle the transition to a multi-tiered I/O configuration.
Today's post-petascale supercomputers have far exceeded the million-core threshold. Such a tremendous computation power is used to run scientific experiments at an unprecedented scale. Furthermore, most modern supercomputers have moved from the paradigm of one large application using the entire machine to one where many smaller applications run concurrently. Yet, this increasing level of parallelism leads to challenges when it comes to sustaining a higher I/O demand. Large-scale applications already individually suffer from unmatched computation and storage performance, leading to a loss of efficiency in I/O-intensive phases. One challenge in HPC storage consists of dealing with the contention between large-scale applications concurrently accessing a shared storage resource. Due to the sharing of resources such as compute nodes, networks, remote PFS, performance variability is observed. This phenomenon is called cross-application I/O interference and is common in most HPC sites. While computing and network resources can be shared effectively by state-of-the-art job schedulers, the same cannot be said about the storage resources. In fact, I/O congestion, within and across independent jobs, is one of the main problems for future HPC machines.
TABIOS: A Distributed Task-based I/O System
In the era of data-intensive computing, large-scale applications, in both scientific and the BigData communities, demonstrate unique I/O requirements leading to a variety of storage solutions which are often incompatible with one another. In this project, we investigate how to support a wide variety of conflicting I/O workloads under a single storage system. We introduce the idea of a DataTask, a new data representation, and, we present TABIOS: a new, distributed, DataTask- based I/O system. TABIOS supports heterogeneous storage resources, offers storage elasticity, and promotes in-situ analytics via data provisioning, and boosts I/O performance by up to 17x via asynchronous I/O. TABIOS demonstrates the effectiveness of storage consolidation to support the convergence of HPC and BigData workloads on a single platform.
TABIOS Software Stack
TABIOS demonstrates the following contributions:
- the effectiveness of storage malleability, where resources can grow/shrink based on the workload.
- how to effectively use asynchronous I/O with the mixed media and various configurable storage options.
- how to support resource heterogeneity based on the targeted hardware configuration supporting a variety of storage resources under the same platform.
- the effectiveness of data provisioning, enabling in-situ data analytics and process-to-process data migration.
- how to support a diverse set of conflicting I/O workloads, from HPC to BigData analytics, on a single platform, through effective storage consolidation.
TABIOS High-Level Architecture
TABIOS achieves these contributions by transforming all I/O requests each into a configurable unit called a DataTask, which is a tuple of an operation and a pointer to user data. Datatasks are pushed from the application to a distributed queue served by a scheduler. TABIOS workers (i.e., storage servers) execute datatasks independently. TABIOS architecture is fully decoupled and distributed. Using datatasks, TABIOS can offer software-defined storage services and QoS guarantees for a variety of workloads on different storage architectures.