FTXS Workshop

Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop

NEWS

FTXS 2023 will be part of the International Conference for High Performance Computing, Networking, Storage, and Analysis for SC23 in Denver, CO, USA, November 12-17, 2023.  We hope to host another great workshop with your help!

Workshop Motivation

FTXS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of High Performance Computing (HPC). In particular, extreme-scale "leadership class" supercomputers fall into this broad category.

For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore's Law scaling in processor frequencies. This progression from single core to multi-core and many-core will be further complicated by the community's imminent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.

Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a correct answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the Approaching Exascale report at SC11, DOE program managers identified resilience as a black swan - the most difficult under-addressed issue facing HPC.

Past and Upcoming FTXS Workshops

FTXS 2010

in association with The 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010) - Chicago, Illinois

FTXS 2012

in association with The 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) - Boston, Massachusetts

FTXS 2013

in association with The 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC'13) - New York City, New York

FTXS 2014

in association with The 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2014) - Atlanta, Georgia

FTXS 2015

in association with The 24th International ACM Symposium on High Performance Distributed Computing (HPDC'15) - Portland, Oregon

FTXS 2016

in association with The 25th International ACM Symposium on High Performance Distributed Computing (HPDC'16) - Kyoto, Japan

FTXS 2017

in association with The 26th International ACM Symposium on High Performance Distributed Computing (HPDC'17) - Washington, D.C.

FTXS 2018

in association with The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) - Dallas, TX

FTXS 2019

in association with The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC19) - Denver, CO

FTXS 2020

in association with The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20) - Atlanta, GA (VIRTUAL)

FTXS 2021

in association with The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21) - St. Louis, MO

FTXS 2022

in association with The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22) - Dallas, TX

FTXS 2023

proposed for inclusion in The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23) - Denver, CO

Questions?

Please address questions about the FTXS workshop series to Nathan DeBardeleben, Los Alamos National Laboratory (ndebard@lanl.gov)