PASA Research - Large-Scale Parallel System Resilience

Application-Centric, Reliable and Efficient High Performance Computing

Mission-critical scientific simulations (e.g., climate simulation and fluid dynamics simulation) and enterprise workloads (e.g., search and encryption) running on large-scale computing systems are jeopardized by the increase of faults and errors in hardware and software. Understanding the vulnerability of these large-scale applications is important to minimize performance and power. Lack of the knowledge of application vulnerability forms a major bottleneck of execution efficiency, and jeopardizes HPC simulation capabilities. Previous works rely on random fault injection or detailed architecture analysis to evaluate application vulnerability. They can be slow and inaccurate. There is a big gap between the needs of reliable and efficient HPC and what the current methodologies can provide.

This research explores a new methodology to understand application vulnerability. It investigates new analytical and statistical models to quantify and characterize application vulnerability based on a novel metric and application semantics (including algorithm semantics and data semantics).

Research Outcome:

[IISWC'20] Luanzheng Guo, Giorgis Georgakoudis, Konstantinos Parasyris, Ignacio Laguna and Dong Li. MATCH: An MPI Fault Tolerance Benchmark Suite. In IEEE International Symposium on Workload Characterization
[IPDPS'19] Luanzheng Guo and Dong Li. MOARD: Modeling Application Resilience to Transient Faults on Data Objects In 33rd IEEE International Parallel and Distributed Processing Symposium.
[SC'18] Luanzheng Guo, Dong Li, Ignacio Laguna, and Martin Schulz. FlipTracker: Understanding Natural Error Resilience in HPC Applications In 30th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[ICPP'18] Kai Wu, Wenqian, Qiang Guan, Nathan DeBardeleben, and Dong Li. Modeling Application Resilience in Large Scale Parallel Execution In 47th International Conference on Parallel Processing.
[SC'17] Kai Wu, Qiang Guan, Nathan DeBardeleben, and Dong Li. Characterization and Comparison of Application Resilience for Serial and Parallel Codes. Poster in ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[SC'16] Luanzheng Guo, Jing Liang, and Dong Li. Understanding Ineffectiveness of Application-Level Fault Injection. Poster in ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (Nominated as the best poster, 2.9% of all poster submissions).
[SC'14] Yu, L., Li, D., Mittal, S., and Vetter, J. S. Quantitatively Modeling Application Resiliency with the Data Vulnerability Factor. In International Conference for High Performance Computing, Networking, Storage and Analysis. Nominated as the best student paper
[SC'13] Li, D., Chen, Z., Wu, P., and Vetter, J. S. Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[SC'12] Li, D., Vetter, J. S., and Yu, W. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In International Conference for High Performance Computing, Networking, Storage and Analysis.

This research is supported by: