Application-Centric, Reliable and Efficient High Performance Computing

Mission-critical scientific simulations (e.g., climate simulation and fluid dynamics simulation) and enterprise workloads (e.g., search and encryption) running on large-scale computing systems are jeopardized by the increase of faults and errors in hardware and software. Understanding the vulnerability of these large-scale applications is important to minimize performance and power. Lack of the knowledge of application vulnerability forms a major bottleneck of execution efficiency, and jeopardizes HPC simulation capabilities. Previous works rely on random fault injection or detailed architecture analysis to evaluate application vulnerability. They can be slow and inaccurate. There is a big gap between the needs of reliable and efficient HPC and what the current methodologies can provide.

This research explores a new methodology to understand application vulnerability. It investigates new analytical and statistical models to quantify and characterize application vulnerability based on a novel metric and application semantics (including algorithm semantics and data semantics).

Research Outcome:


This research is supported by:

                            NSF 4-Color Bitmap Logo      LLNL Logo.pngOak Ridge National Laboratory logo.svgLos Alamos National Laboratory