Global View Resilience (GVR) is a new approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. Key novel features in GVR include:
With a global versioned array as a portable abstraction, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. We will research algorithms and a runtime that map and adapt the application/system’s reliability deployment based on application-specified reliability priorities. The unified error handling framework enables applications error detection (checking) and recovery routines that handle diverse classes of errors with a single application recovery. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.
Background: Meeting the reliability needs of exascale applications on the projected system sizes and silicon technologies requires a new approach in scientific computing—the integration of resilience as an essential element of the computing model. Reliable computing at exascale is a programming model and programming system challenge. Current approaches to reliability do not allow applications to express their reliability needs.
Global naming of distributed data yields programmability benefits that include simpler expression of algorithms and decoupling of computation and data structure across increasingly complex (irregular, variable, degraded) hardware. Moreover, applications are unable to bring their computation semantics or even further scientific domain semantics to bear on error detection and correction. Consequently, many errors are “silent” ("latent" or undetected), and many detected errors go uncorrected.
Research areas: global-view data, programmer-guided resilience and recovery, multi-version storage, non-volatile memory, flexible error detection and recovery, compression, and co-design
We gratefully acknowlege support for the GVR project from the U.S. Department of Energy, Office of Science / ASCR under awards DE-SC0008603/57K68-00-145.
People: Hajime Fujita, Zachary Rubenstein, Ziming Zheng, Aiman Fang, Nan Dun, Fan Yang, Andrew A. Chien (UChicago), James Dinan, Pavan Balaji, Pete Beckman, Kamil Iskra, Jeff Hammond, Wesley Bland (Argonne), Rob Schreiber (HP), Guoming Lu (UESTC)