FTXS 2015

Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop

FTXS 2015 will be returning to HPDC as part of The 24th International ACM Symposium on High Performance Distributed Computing in Portland, Oregon, June 15-19, 2015.  We look forward to another great workshop with your help!

Advanced Micro Devices (AMD) is sponsoring the first FTXS best paper award!  This award will be chosen by the PC and awarded at FTXS 2015.  It will include a prize to be determined by AMD at a later date.

Workshop Motivation
FTXS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of High Performance Computing (HPC). In particular, extreme-scale "leadership class" supercomputers fall into this broad category.

For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore's Law scaling in processor frequencies. This progression from single core to multi-core and many-core will be further complicated by the community's imminent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.

Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a correct answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the Approaching Exascale report at SC11, DOE program managers identified resilience as a black swan - the most difficult under-addressed issue facing HPC.

Past and Upcoming FTXS Workshops

in association with The 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) - Boston, Massachusetts


Please address FTXS workshop questions to Nathan DeBardeleben, Los Alamos National Laboratory (ndebard@lanl.gov)