FTXS 2025
Workshop on Fault Tolerance for HPC at eXtreme Scale
(CANCELLED)
(CANCELLED)
Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).
Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed. Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults.
[July 10, 2025] FTXS 2025 CANCELLED. We have made the very difficult decision to cancel the workshop this year.
[June 13, 2025] The submission deadline has been extended to June 26, 2025
[May 23, 2025] Submission details have been added below
[May 7, 2025] The submission dates and the date of the workshop have been added below.
[April 5, 2025] Our proposal was accepted! FTXS 2025 will be held in conjunction with ICPP 2025 in San Diego. 🎉🎉🎉
Paper submissions open: May 19, 2025
Paper submission closes: June 19, 2025 June 26, 2025
Author notification: July 17, 2025
Camera-ready papers: July 31, 2025
Workshop: September 8, 2025 (start and end time TBD)
All deadlines are Anywhere-on-earth (AoE), the workshop start and end time are Pacific Daylight Time (UTC−07:00).
WHEN : September 8, 2025
WHERE : San Diego, CA, USA
VENUE : Catamaran Resort Hotel
REGISTRATION : Register to attend ICPP 2025 HERE (registration is not yet open)
SUBMISSION : Papers should be submitted at: https://ssl.linklings.net/conferences/icpp/
UPDATES : Follow us on Twitter ( @ftxsworkshop ) for the latest news and updates on the workshop
QUESTIONS : contact Scott Levy (sllevy@sandia.gov) or Bo Fang (bo.fang@pnnl.gov)
TBD
FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems. We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:
Artificial Intelligence and Machine Learning (AI/ML) Significant research has recently been published (including at SC23) on how AI/ML can be leveraged to improve the performance of extreme-scale systems. In the context of fault tolerance and resilience, AI/ML applications have the potential to exhibit novel failure modes during both training and inference. Additionally, AI/ML may help to mitigate failures by either: predicting when and where failures may occur, or by reducing the impact of failures that do occur. Our understanding of AI/ML along these two dimensions of fault tolerance is developing rapidly and is an important area of research.
System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are also starting to be deployed. However, there are many resilience and fault tolerance issues associated with these devices that still need to be resolved. Papers at prominent recent conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding the fault tolerance implications of heterogeneous compute devices is an important and active area of research.
Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and neuromorphic computing, have attracted significant research interest. Recent publications demonstrate that understanding the fault tolerance implications of these computing paradigms is also an area of active research.
Machine Learning: Algorithms that rely on elements of machine learning are becoming more and more prevalent on HPC systems. Understanding how these algorithms react and respond to the frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to provide accurate and timely answers.
Additional topics of interest include, but are not limited to:
Techniques for both addressing faults in AI/ML applications and using AI/ML to mitigate faults
Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
Silent data corruption (SDC) detection / correction techniques
Novel fault-tolerance techniques and implementations
Failure data analysis and field studies
Power, performance, resilience (PPR) assessments / tradeoffs
Emerging hardware and software technology for resilience
Advances in reliability monitoring, analysis, and control of highly complex systems
Failure prediction, error preemption, and recovery techniques
Fault-tolerant programming models
Models for software and hardware reliability
Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
Near-threshold-voltage implications and evaluations for reliability
Benchmarks and experimental environments including fault injection
Frameworks and APIs for fault-tolerance and fault management
Submissions are solicited in the following categories:
Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation. Regular papers should not exceed eleven (10) pages, not including references. Accepted regular papers that meet these requirements will be published.
Extended abstracts presenting preliminary results, proposing disruptive ideas, or challenging assumptions in the field. The inclusion of some form of preliminary results is encouraged. Extended abstract papers should not exceed four (4) pages, not including references. Extended abstracts will be evaluated separately and given shorter oral presentations.
Submissions shall be submitted to https://ssl.linklings.net/conferences/icpp/ and must conform to the requirements established by IEEE at: https://www.ieee.org/conferences/publishing/templates.html. LaTeX and MS Word templates are also available at this link.
Subject to publisher constraints, our workshop will publish all submissions accepted for inclusion in our workshop.
FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community. We use anonymous surveys to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community. The results of this survey will be added below.
Scott Levy - Sandia National Laboratories
Bo Fang - Pacific Northwest National Laboratory
Keita Teranishi – Oak Ridge National Laboratory
John Daly – Laboratory for Physical Sciences
Rizwan A. Ashraf Pacific Northwest National Laboratory
Aurelien Bouteiller University of Tennessee
Jon Calhoun Clemson University
Chris Cantwell Imperial College, London
Zizhong Chen University of California, Riverside
James Elliott Sandia National Laboratories
Christian Engelmann Oak Ridge National Laboratory
Claudia Fohry University of Kassel
Qiang Guan Kent State University
Hemanth Kolla Sandia National Laboratories
Zhiling Lan University of Illinois Chicago
Nicolas Morales Sandia National Laboratories
Thomas Naughton Oak Ridge National Laboratory
Bogdan Nicolae Argonne National Laboratory, Illinois Institute of Technology
Yves Robert ENS Lyon, University of Tennessee
Thomas Ropars Grenoble Alpes University, France
Lipeng Wan Georgia State University