GPFS: A Shared-Disk File System for Large Computing Clusters

Abstract:

GPFS is IBM’s parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

Introduction:

Since the beginning of computing, there have always been problems too big for the largest machines of the day. This situation persists even with today’s powerful CPUs and shared-memory multiprocessors. Advances in communication technology have allowed numbers of machines to be aggregated into computing clusters of effectively unbounded processing power and storage capacity that can be used to solve much larger problems than could a single machine. Because clusters are composed of independent and effectively redundant computers, they have a potential for fault-tolerance. This makes them suitable for other classes of problems in which reliability is paramount. As a result, there has been great interest in clustering technology in the past several years.

One fundamental drawback of clusters is that programs must be partitioned to run on multiple machines, and it is difficult for these partitioned programs to cooperate or share resources. Perhaps the most important such resource is the file system. In the absence of a cluster file system, individual components of a partitioned program must share cluster storage in an adhoc manner. This typically complicates programming, limits performance, and compromises reliability.

GPFS is a parallel file system for cluster computers that provides, as closely as possible, the behavior of a general-purpose POSIX file system running on a single machine. GPFS evolved from the Tiger Shark multimedia file system [1]. GPFS scales to the largest clusters that have been built, and is used on six of the ten most powerful supercomputers in the world, including the largest, ASCI White at Lawrence Livermore National Laboratory. GPFS successfully satisfies the needs for throughput, storage capacity, and reliability of the largest and most demanding problems.

Traditional supercomputing applications, when run on a cluster, require parallel access from multiple nodes within a file shared across the cluster. Other applications, including scalable file and Web servers and large digital libraries, are characterized by interfile parallel access. In the latter class of applications, data in individual files is not necessarily accessed in parallel, but since the files reside in common directories and allocate space on the same disks, file system data structures (metadata) are still accessed in parallel. GPFS supports fully parallel access both to file data and metadata. In truly large systems, even administrative actions such as adding or removing disks from a file system or rebalancing files across disks, involve a great amount of work. GPFS performs its administrative functions in parallel as well.

GPFS achieves its extreme scalability through its shared-disk architecture (Figure 1) [2]. A GPFS system consists of the cluster nodes, on which the GPFS file system and the applications that use it run, connected to the disks or disk subsystems over a switching fabric. All nodes in the cluster have equal access to all disks. Files are striped across all disks in the file system – several thousand disks in the largest GPFS installations. In addition to balancing load on the disks, striping achieves the full throughput of which the disk subsystem is capable.

The switching fabric that connects file system nodes to disks may consist of a storage area network (SAN), e.g. fibre channel or iSCSI. Alternatively, individual disks may be attached to some number of I/O server nodes that allow access from file system nodes through a software layer running over a general-purpose communication network, such as IBM’s Virtual Shared Disk (VSD) running over the SP switch. Regardless of how shared disks are implemented, GPFS only assumes a conventional block I/O interface with no particular intelligence at the disks.

Parallel read-write disk accesses from multiple nodes in the cluster must be properly synchronized, or both user data and file system metadata will become corrupted. GPFS uses distributed locking to synchronize access to shared disks. GPFS distributed locking protocols ensure file system consistency is maintained regardless of the number of nodes simultaneously reading from and writing to the file system, while at the same time allowing the parallelism necessary to achieve maximum throughput.

This paper describes the overall architecture of GPFS, details some of the features that contribute to its performance and scalability, describes its approach to achieving parallelism and data consistency in a cluster environment, describes its design for fault-tolerance, and presents data on its performance.

Summary and Conclusions:

GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. we found that distributed locking scales quite well. Nevertheless, several significant changes to conventional file system data structures and locking algorithms yielded big gains in performance, both for parallel access to a single large file and for parallel access to large numbers of small files. We describe a number of techniques that make distributed locking work in a large cluster: byte-range token optimizations, dynamic selection of meta nodes for managing file metadata, segmented allocation maps, and allocation hints.

One might similarly question whether conventional availability technology scales. Obviously there are more components to fail in a large system. Compounding the problem, large clusters are so expensive that their owners demand high availability. Add to this the fact that file systems of tens of terabytes are simply too large to back up and restore. Again, we found the basic technology to be sound. The surprises came in the measures that were necessary to provide data integrity and availability. GPFS replication was implemented because at the time RAID was more expensive than replicated conventional disk. RAID has taken over as its price has come down, but even its high level of integrity is not sufficient to guard against the loss of a hundred terabyte file system.

Existing GPFS installations show that our design is able to scale up to the largest super computers in the world and to provide the necessary fault tolerance and system management functions to manage such large systems. Nevertheless, we expect the continued evolution of technology to demand ever more scalability. The recent interest in Linux clusters with inexpensive PC nodes drives the number of components up still further. The price of storage has decreased to the point that customers are seriously interested in petabyte file systems. This trend makes file system scalability an area of interest for research that will continue for the foreseeable future.