FTXS 2014

Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop

Workshop Motivation
FTXS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of High Performance Computing (HPC). In particular, extreme-scale "leadership class" supercomputers fall into this broad category.

For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore's Law scaling in processor frequencies. This progression from single core to multi-core and many-core will be further complicated by the community's imminent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.

Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a correct answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the Approaching Exascale report at SC11, DOE program managers identified resilience as a black swan - the most difficult under-addressed issue facing HPC.

Past and Upcoming FTXS Workshops

in association with The 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) - Boston, Massachusetts

Questions?

Please address FTXS workshop questions to Nathan DeBardeleben, Los Alamos National Laboratory (ndebard@lanl.gov)