FTXS 2024
Workshop on Fault Tolerance for HPC at eXtreme Scale
Looking for FTXS 2025? https://sites.google.com/view/ftxs2025
Looking for FTXS 2025? https://sites.google.com/view/ftxs2025
Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).
Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed. Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults.
[October 23, 2024] Program has been finalized (see below).
[October 9, 2024] Tentative program for the workshop is included below
[August 6, 2024] SUBMISSION UPDATE: All papers must be submitted by August 8, 2024 (AoE). However, submitted papers may continue to be updated until August 11, 2024 (AoE)
[July 25, 2024] Paper deadline has been extended to August 8, 2024 (AoE)
[June 26, 2024] Karthik Pattabiraman from the University of British Columbia has accepted our invitation to be our workshop's Featured Speaker. He will discuss his work on understanding fault tolerance and resilience for Machine Learning. More details will be coming shortly.
[May 29, 2024] Our workshop will be held on Friday, November 22, 2024.
[May 7, 2024] Updated to indicated that extended abstracts WILL NOT be published based on SC24 publication limits.
[March 20, 2024] Our proposal was accepted! FTXS 2024 will be in conjunction with SC24. 🎉🎉🎉
Paper submissions open: July 1, 2024
Paper submission closes: August 1, 2024 August 8, 2024 (EXTENDED, new submissions are not permitted after this deadline but submitted papers may continue to updated until August 11, 2024)
Author notification: September 5, 2024
Camera-ready papers: September 27, 2024
Workshop: Friday, November 22, 2024 @ 8:30am - 12:00pm EST
All deadlines are Anywhere-on-earth (AoE), the workshop start and end time are Eastern Standard Time.
WHEN : Friday, November 22, 2024 @ 8:30am - 12:00pm EST
WHERE : Atlanta, GA, USA
VENUE : Georgia World Congress Center
REGISTRATION : Register to attend SC24 HERE (registration opens July 10, 2024)
SUBMISSION : Papers should be submitted at: https://submissions.supercomputing.org/
UPDATES : Follow us on Twitter ( @ftxsworkshop ) for the latest news and updates on the workshop
QUESTIONS : contact Scott Levy (sllevy@sandia.gov) or Bo Fang (bo.fang@pnnl.gov)
Abstract: Machine learning (ML) has increasingly been adopted in safety-critical systems such as Autonomous vehicles (AVs) and industrial robotics. In these domains, reliability and safety are important considerations, and hence it is critical to ensure the resilience of ML systems to faults and errors. This also applies to ML systems deployed in the HPC context. On the other hand, soft errors are becoming more frequent in commodity computer systems due to the effects of technology scaling and reduced supply voltages. Further, traditional solutions for masking hardware faults such as Triple-Modular Redundancy (TMR) are prohibitively expensive in terms of their energy and performance overheads. Therefore, there is a compelling need to provide low-cost error resilience to ML applications on commodity HPC platforms.
I will present three directions we have explored in my research group towards this goal. First, we experimentally assessed the resilience of ML applications to soft errors via fault injection. We found that even a single bit flip due to a soft error can lead to misclassification in Deep Neural Network (DNN) applications. Such misclassifications can result in safety violations. However, not all errors result in safety violations, and hence it is sufficient to protect the DNN from the ones that do. Unfortunately, finding all possible errors that result in safety violations is a very compute-intensive task. Second, we proposed BinFI, a fault injection approach that efficiently injects critical faults that are highly likely to result in safety violations, by leveraging the DNN’s properties. Finally, we proposed Ranger, an approach to protect DNNs from critical faults without causing any loss in their accuracies, and with minimal performance overheads. I will conclude by presenting some of our ongoing work as well as the future challenges in this area.
This is joint work with my students and colleagues at UBC, as well as industry collaborators.
Bio: Karthik Pattabiraman is a Professor of Electrical and Computer Engineering (ECE) at the University of British Columbia (UBC). He received a PhD in 2009 in Computer Science from the University of Illinois at Urbana-Champaign (UIUC), an MS in Computer Science also from UIUC in 2004, and B. Tech. from the University of Madras, India, in 2001. Before joining UBC in 2010, he was a postdoctoral researcher at Microsoft Research (MSR), Redmond. Karthik’s research interests are in dependable computer systems, high-performance computing, cyber-physical systems and software security. Karthik has won awards such as the Inaugural IEEE Rising Star in Dependability Award, UIUC CS department’s early career alumni achievement award, UBC-wide Killam mentoring excellence award, UBC-wide Killam Faculty Research Prize and the Killam Faculty Research Fellowship, NSERC Discovery Accelerator Supplement (DAS) in Canada, and the William Carter PhD Dissertation Award. Karthik is the vice-chair of the IFIP Working Group (WG) 10.4 on Dependable Computing and Fault-tolerance, and a member of the steering committee of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). He is a distinguished member of the ACM, a distinguished contributor of the IEEE Computer Society, and a professional engineer (P.Eng.) in Canada. More info at: https://blogs.ubc.ca/karthik/
FTXS 2024 will be held on Friday, November 22, 2024. The tentative schedule for the workshop is provided below. All times are Eastern Standard Time (UTC-5): the time zone of Atlanta, GA, USA where the conference will be held. Papers will be published by the IEEE.
[8:30am] Opening remarks
[8:30-9:30am] Featured Speaker
Error-Resilient Machine Learning for HPC: Challenges and Opportunities
Karthik Pattabiraman (Univ. of British Columbia)
[9:30-10:00am]
Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks
Camarero, Cano, Martínez, Beivide
[10:00-10:30am] SC24 Morning break
[10:30-11:00am]
From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments
George, Wang, Hanley, Ransom, Bent, Zimmer
[11:00-11:30am]
Octopus: Experiences with a Hybrid Event-Driven Architecture for Distributed Scientific Computing
Pan, Chard, Zhou, Kamatar, Vescovi, Hayot-Sasson, Bauer, Gonthier, Chard, Foster
[11:30am-12:00pm]
Checkpointing strategies for a fixed-length execution
Benoit, Perotin, Robert, Vivien
[12:00pm] Closing remarks
FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems. We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:
Artificial Intelligence and Machine Learning (AI/ML) Significant research has recently been published (including at SC23) on how AI/ML can be leveraged to improve the performance of extreme-scale systems. In the context of fault tolerance and resilience, AI/ML applications have the potential to exhibit novel failure modes during both training and inference. Additionally, AI/ML may help to mitigate failures by either: predicting when and where failures may occur, or by reducing the impact of failures that do occur. Our understanding of AI/ML along these two dimensions of fault tolerance is developing rapidly and is an important area of research.
System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are also starting to be deployed. However, there are many resilience and fault tolerance issues associated with these devices that still need to be resolved. Papers at prominent recent conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding the fault tolerance implications of heterogeneous compute devices is an important and active area of research.
Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and neuromorphic computing, have attracted significant research interest. Recent publications demonstrate that understanding the fault tolerance implications of these computing paradigms is also an area of active research.
Machine Learning: Algorithms that rely on elements of machine learning are becoming more and more prevalent on HPC systems. Understanding how these algorithms react and respond to the frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to provide accurate and timely answers.
Additional topics of interest include, but are not limited to:
Techniques for both addressing faults in AI/ML applications and using AI/ML to mitigate faults
Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
Silent data corruption (SDC) detection / correction techniques
Novel fault-tolerance techniques and implementations
Failure data analysis and field studies
Power, performance, resilience (PPR) assessments / tradeoffs
Emerging hardware and software technology for resilience
Advances in reliability monitoring, analysis, and control of highly complex systems
Failure prediction, error preemption, and recovery techniques
Fault-tolerant programming models
Models for software and hardware reliability
Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
Near-threshold-voltage implications and evaluations for reliability
Benchmarks and experimental environments including fault injection
Frameworks and APIs for fault-tolerance and fault management
Submissions are solicited in the following categories:
Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation. Regular papers must be at least six (6) pages and should not exceed eleven (11) pages including all text, appendices, figures, and references. Accepted regular papers that meet these requirements will be published.
Extended abstracts presenting preliminary results, proposing disruptive ideas, or challenging assumptions in the field. The inclusion of some form of preliminary results is encouraged. Extended abstract papers should not exceed four (4) pages, including all text, figures, and references. Extended abstracts will be evaluated separately and given shorter oral presentations. Given minimum publication requirements imposed by SC24, extended abstracts WILL NOT be published.
Submissions shall be submitted to https://submissions.supercomputing.org/ and must conform to the requirements established by IEEE at: https://www.ieee.org/conferences/publishing/templates.html. LaTeX and MS Word templates are also available at this link.
Subject to publisher constraints, our workshop will publish all submissions accepted for inclusion in our workshop.
Reproducibility is an important component of extreme-scale system research. However, the goal of our workshop is to enourage and facilitate discussion of novel approaches and preliminary results. As a result, it may not always be feasible to release reproducibility artifacts. Moreover, to the greatest extent possible, we want to minimize unnecessary obstacles to socializing new ideas. Therefore, while we encourage to make their work as public and reproducible as possible, we do not explicitly require it.
As part of SC24, FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community (see here for more information SC24's commitment to inclusivity and diversity). As a first step, we used an anonymous survey to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community. The results of this survey are included below. Survey respondents were allowed to decline to answer any of the questions. Because our committee comprises a relatively small number of people, we are not releasing exact numbers in an effort to protect their privacy.
Approximately 2/3 of our Program Committee completed our anonymous demographics survey
GEOGRAPHY
Approximately 2/5 of respondents reported North America as their primary work location
Approximately 1/6 of respondents reported Europe as their primary work location
GENDER
Approximately 9/10 of respondents identify as male
RACIAL & ETHNIC GROUPS
Approximately 1/10 of respondents identified as a racial or ethnic minority where they work
Scott Levy - Sandia National Laboratories
Bo Fang - Pacific Northwest National Laboratory
Keita Teranishi – Oak Ridge National Laboratory
John Daly – Laboratory for Physical Sciences
Rizwan A. Ashraf Pacific Northwest National Laboratory
Aurelien Bouteiller University of Tennessee
Jon Calhoun Clemson University
Chris Cantwell Imperial College, London
Zizhong Chen University of California, Riverside
James Elliott Sandia National Laboratories
Christian Engelmann Oak Ridge National Laboratory
Wilfried Gansterer University of Vienna
Qiang Guan Kent State University
Zhiling Lan University of Illinois Chicago
Jackson Mayo Sandia National Laboratories
Nicolas Morales Sandia National Laboratories
Thomas Naughton Oak Ridge National Laboratory
Bogdan Nicolae Argonne National Laboratory, Illinois Institute of Technology
Yves Robert ENS Lyon, University of Tennessee
Thomas Ropars Grenoble Alpes University, France
Lipeng Wan Georgia State University