Global View Resilience (GVR) is a new approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. Key novel features in GVR include:
- multi-version arrays with each versioning rate controlled separately by the application (multi-stream)
- flexible multi-version recovery
- unified error signalling and handling for flexible cross-layer error recovery.
With a global versioned array as a portable abstraction, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. We will research algorithms and a runtime that map and adapt the application/system’s reliability deployment based on application-specified reliability priorities. The unified error handling framework enables applications error detection (checking) and recovery routines that handle diverse classes of errors with a single application recovery. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.
Background: Meeting the reliability needs of exascale applications on the projected system sizes and silicon technologies requires a new approach in scientific
computing—the integration of resilience as an essential element of the computing model. Reliable computing at exascale is a programming model and programming system challenge. Current approaches to reliability do not allow applications to express their reliability needs.
Global naming of distributed data yields programmability benefits that include simpler expression of algorithms and decoupling of computation and data structure across increasingly complex (irregular, variable, degraded) hardware. Moreover, applications are unable to bring their computation semantics or even further scientific domain semantics to bear on error detection and correction. Consequently, many errors are “silent” ("latent" or undetected), and many detected errors go uncorrected.
Research areas: global-view data, programmer-guided resilience and recovery, multi-version storage, non-volatile memory, flexible error detection and recovery, compression, and co-design
Nan Dun, Hajime Fujita, John Tramm, Andrew A. Chien, and Andrew R. Siegel. Data Decomposition in Monte Carlo Particle Transport Simulations using Global View Arrays, UChicago CS Tech Report 2014-09 May 2014.Hajime Fujita, Nan Dun, Zachary Rubenstein, and Andrew A. Chien. Log-Structured Global Array for Efficient Multi-Version Snapshots, UChicago CS Tech Report 2014-08, May 2014.The GVR Team, How Applications Use GVR: Use Cases, University of Chicago, Computer Science Technical Report 2014-06.The GVR Team, Global View Resilience, API Documentation R0.8.1-rc0, University of Chicago, Computer Science Technical Report 2014-05.Aiman Fang and Andrew A. Chien, "Applying GVR to Molecular Dynamics: Enabling Resilience for Scientific Computations", Tech Report, University of Chicago, Dept of Computer Science, CS-TR-2014-04, April 2014.Ziming Zheng, Andrew A. Chien, Keita Teranishi, "Fault Tolerance in an Inner-Outer Solver: a GVR-enabled Case Study", in Proceedings of VECPAR 2014, July 2014, Eugene, Oregon. Proceedings available from Springer-Verlag Lecture Notes in Computer Science.Z. Rubenstein, "Error Checking and Snapshot-based Recovery in Preconditioned Conjugate Gradient Solver", Masters Thesis, University of Chicago, Department of Computer Science, March 2014.Z. Rubenstein, J. Dinan, H. Fujita, Z. Zheng, A. Chien, "Error Checking and Snapshot-Based Recovery in a Preconditioned Conjugate Gradient Solver", University of Chicago, Department of Computer Science Technical Report 2013-11, December 2013Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and JackJ. Dongarra. An evaluation of User-Level Failure Mitigation support in MPI. Computing, 95(12):1171–1184, 2013.Ziming Zheng, Zachary Rubenstein, and Andrew A. Chien, GVR-Enabled Trilinos: An Outside-In Approach for Resilient Computing, in the SIAM Conference on Parallel Processing, February 2014, Portland Oregon.
Ziming Zheng, Andrew A. Chien, Mark Hoemmen, Keita Teranishi, "Fault Tolerance in an Inner-Outer Solver: a GVR-enabled Case Study", available as Technical Report from University of Chicago Department of Computer Science, CS-TR-2014-01, January 2014.
Guoming Lu, Ziming Zheng, and Andrew A. Chien, When are Multiple Checkpoints Needed?, in 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), at IEEE Conference on High Performance Distributed Computing, June 2013, New York, New York.
Hajime Fujita, Robert Schreiber, Andrew A. Chien, It's Time for New Programming Models for Unreliable Hardware, to appear in ASPLOS 2013 Provocative Ideas session, March 18, 2013.
Sean Hogan, Jeff Hammond, and Andrew A. Chien, An Evaluation of Difference and Threshold Techniques for Efficient Checkpointing, 2nd Workshop on Fault-Tolerance at Extreme Scale FTXS 2012 at DSN 2012, June 2012, Boston, Massachusetts.
People: Hajime Fujita, Zachary Rubenstein, Ziming Zheng, Aiman Fang, Nan Dun, Fan Yang, Andrew A. Chien (UChicago), James Dinan, Pavan Balaji, Pete Beckman, Kamil Iskra, Jeff Hammond, Wesley Bland (Argonne), Rob Schreiber (HP), Guoming Lu