FTXS 2023
Workshop on Fault Tolerance for HPC at eXtreme Scale
Workshop Overview
Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).
Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed. Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults.
News & Announcements
[11/10/2023] The SC23 presenter template is available HERE
[10/02/2023] Tentative workshop schedule has been added below.
[8/15/2023] SUBMISSION UPDATE: All papers must be submitted by August 16, 2023 AoE. However, we will allow papers that have been submitted by this deadline to be updated until August 18, 2023 AoE.
[8/08/2023] We have extended the deadline; papers are now due August 16, 2023 AoE. This will be the only extension.
[7/18/2023] We're excited to announce that Paolo Rech from Università di Trento (Italy) and UFRGS (Brazil) will be our Featured Speaker. He'll present his work on reliability in quantum computing systems. Details forthcoming.
[5/16/2023] We have added the members of our Program Committee (see below).
[5/15/2023] We have made minor changes to page limits, and the author notification and camera-ready deadlines (see below) to comply with SC23 proceedings requirements.
[3/17/2023] FTXS 2023 has been accepted as a half-day workshop to be included as part of the SC23 program! We are waiting to learn which day our workshop will be held on.
Important Dates
Paper submissions open: July 1, 2023
Paper submission closes: August 10, 2023 August 16, 2023
Author notification: September 7, 2023
Camera-ready papers: September 28, 2023
Workshop: November 12, 2023 (2:00 -5:30pm MST)
All dates Anywhere-on-earth (AoE)
Workshop Details
WHEN : November 12, 2023 (2:00-5:30pm MST)
WHERE : Denver, CO, USA
VENUE : Colorado Convention Center (Room 605)
REGISTRATION : Register to attend SC23 HERE (registration opens July 12, 2023)
SUBMISSION : Papers should be submitted at: https://submissions.supercomputing.org/
UPDATES : Follow us on Twitter ( @ftxsworkshop ) for the latest news and updates on the workshop
QUESTIONS : contact Scott Levy (sllevy@sandia.gov)
Workshop Program
FTXS 2023 will be held on Sunday, November 12, 2023. The tentative schedule for the workshop is provided below. All times are Mountain Standard Time (GMT-7): the time zone of Denver, CO, USA where the conference will be held. Papers will be published by the ACM and will be available in the ACM Digital Library
[2:00-2:01pm] Opening remarks
[2:01-3:00pm] Featured Speaker
Quantum Computing Reliability: Problems, Tools, and Potential Solutions
Professor Paolo Rech (Università di Trento)
Abstract: Quantum computing is a new computational paradigm, expected to revolutionize the computing field in the next few years.
Qubits, the atomic units of a quantum circuit, exploit the quantum physics properties to increase the parallelism and speed of computation.
Unfortunately, qubits are both intrinsically noisy and highly susceptible to external sources of faults, such as ionizing radiation.
The reported qubits error rate is so high that researchers are questioning the large-scale adoption of quantum computers and forces unpractical mitigation solutions such as installing the quantum computer in underground caves.
Innovative solutions to improve the reliability of quantum applications are then highly necessary.
In the talk, after providing all information and background needed to understand quantum computing basics and an overview of the available quantum technologies vulnerabilities, we will present the available hardening solutions and the open challenges that need to be addressed. We will consider both the intrinsic noise, that has a predictable and incremental effect, and radiation-induced transient faults, that are stochastic and modify the qubit in an unpredictable way. Based on the latest studies and radiation experiments performed on real quantum machines, we will show how to model the transient faults in a qubit and how to inject this fault in a quantum circuit to track its propagation. We will discuss the vulnerability of qubits and of circuits, identifying the most critical parts and the main course for output corruption. Finally, we will provide an overview of the open (reliability) challenges in quantum computing to stimulate further studies and solutions.
[3:00-3:30pm] SC23 Afternoon Break
[3:30-3:55pm] Regular Paper 1
Optimizing Write Performance for Checkpointing to Parallel File Systems Using LSM-Trees
Bulut, Wright
[3:55-4:20pm] Regular Paper 2
Recovery from Silent Data Corruption via Spatial Data Prediction
Guernsey, Placke, Poulos, Calhoun
[4:20-4:40pm] Short Paper Lightning Talks
(4:20-4:26pm) Short Paper 1
Disk Failure Trends in Alpine Storage System
George, Hanley, Oral
(4:26-4:33pm) Short Paper 2
Using Benford's Law to Identify Unusual Failure Regions
Ferreira, Levy
(4:33-4:40pm) Short Paper 3
Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts
Chen, Verrecchia, Sun, Booth, Raghavan
[4:40-5:05pm] Regular Paper 3
Evaluating the Resiliency of Posits for Scientific Computing
Schlueter, Calhoun, Poulos
[5:05-5:29pm] Regular Paper 4
When to checkpoint at the end of a fixed-length reservation?
Barbut, Benoit, Herault, Robert, Vivien
[5:29-5:30pm] Closing remarks
Workshop Topics
FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems. We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:
Storage Devices: The storage hierarchy on HPC systems continues to increase in depth and complexity. SSDs and NVMe add high-speed node-local (or rack-local) persistent storage that can be used to improve the performance of checkpoint/restart or otherwise facilitate application resilience. Continuing to efficiently exploit these devices remains critical for extreme-scale HPC systems. Moreover, the recent availability of Non-Volatile Memory Modules (NVMMs) has begun to blur the line between memory and storage. The implications of this blurring for fault tolerance on extreme-scale systems are still being explored.
System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are also starting to be deployed. However, there are many resilience and fault tolerance issues associated with these devices that still need to be resolved. Papers at prominent recent conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding the fault tolerance implications of heterogeneous compute devices is an important and active area of research.
Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and neuromorphic computing, have attracted significant research interest. Recent publications demonstrate that understanding the fault tolerance implications of these computing paradigms is also an area of active research.
Machine Learning: Algorithms that rely on elements of machine learning are becoming more and more prevalent on HPC systems. Understanding how these algorithms react and respond to the frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to provide accurate and timely answers.
Additional topics of interest include, but are not limited to:
Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
Silent data corruption (SDC) detection / correction techniques
Novel fault-tolerance techniques and implementations
Failure data analysis and field studies
Power, performance, resilience (PPR) assessments / tradeoffs
Emerging hardware and software technology for resilience
Advances in reliability monitoring, analysis, and control of highly complex systems
Failure prediction, error preemption, and recovery techniques
Fault-tolerant programming models
Models for software and hardware reliability
Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
Near-threshold-voltage implications and evaluations for reliability
Benchmarks and experimental environments including fault injection
Frameworks and APIs for fault-tolerance and fault management
Submission Details
Submissions are solicited in the following categories:
Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation. Regular papers must be at least six (6) pages and should not exceed eleven (11) pages including all text, appendices, figures, and references. Accepted regular papers that meet these requirements will be published (subject to publisher constraints).
Extended abstracts presenting preliminary results, proposing disruptive ideas, or challenging assumptions in the field. The inclusion of some form of preliminary results is encouraged. Extended abstract papers should not exceed four (4) pages, including all text, figures, and references. Extended abstracts will be evaluated separately and given shorter oral presentations.
Submissions shall be submitted to https://submissions.supercomputing.org/ and must conform to the requirements established by ACM at: https://www.acm.org/publications/proceedings-template. LaTeX and MS Word templates are also available at this link.
Publication
Subject to publisher constraints, our workshop will publish all submissions accepted for inclusion in our workshop.
Reproducibility
Reproducibility is an important component of extreme-scale system research. However, the goal of our workshop is to enourage and facilitate discussion of novel approaches and preliminary results. As a result, it may not always be feasible to release reproducibility artifacts. Moreover, to the greatest extent possible, we want to minimize unnecessary obstacles to socializing new ideas. Therefore, while we encourage to make their work as public and reproducible as possible, we do not explicitly require it.
Diversity & Inclusivity
As part of SC23, FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community (see here for more information SC23's commitment to inclusivity and diversity). As a first step, we used an anonymous survey to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community. The results of this survey are included below. Because our committee comprises a relatively small number of people, we are not releasing exact numbers in an effort to protect their privacy.
Approximately 2/3 of our Program Committee completed our anonymous demographic survey
GEOGRAPHY
Approximately 5/6 of respondents reported North America as their primary work location
Approximately 1/6 of respondents reported Europe as their primary work location
GENDER
Approximately 3/4 of respondents identify as male
Approximately 1/4 of respondents identify as female
RACIAL & ETHNIC GROUPS
No respondents identified as a racial or ethnic minority where they work
Workshop Chair
Scott Levy - Sandia National Laboratories
Workshop Organizing Committee
Keita Teranishi – Oak Ridge National Laboratory
John Daly – Laboratory for Physical Sciences
Conflict Chairs
Qiang Guan - Kent State University
Bogdan Nicolae - Argonne National Laboratory
Program Committee
Aurelien Bouteiller - University of Tennessee, Knoxville
Chris Cantwell - Imperial College, London
Zizhong Chen - UC Riverside
Florina M. Ciorba - University of Basel
James Elliott - Sandia National Laboratories
Christian Engelmann - Oak Ridge National Laboratory
Bo Fang - Pacific Northwest National Laboratory
Wilfried Gansterer - University of Vienna
Qiang Guan - Kent State University
Haewon Jeong - Harvard University
Zhiling Lan - Illinois Institute of Technology
Jackson Mayo - Sandia National Laboratories
Thomas Naughton - Oak Ridge National Laboratory
Bogdan Nicolae - Argonne National Laboratory
Yves Robert - ENS Lyon, University of Tennessee
Thomas Ropars - University of Grenoble
Lipeng Wan - Georgia State University
Panruo Wu - University of Houston
Zhengji Zhao - NERSC / Lawrence Berkeley Laboratory