FTXS 2023

Workshop on Fault Tolerance for HPC at eXtreme Scale

Workshop Overview

Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).

Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed.  Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults. 

News & Announcements

Important Dates

All dates Anywhere-on-earth (AoE)

Workshop Details

Workshop Program

FTXS 2023 will be held on Sunday, November 12, 2023.  The tentative schedule for the workshop is provided below.  All times are Mountain Standard Time (GMT-7): the time zone of Denver, CO, USA where the conference will be held.  Papers will be published by the ACM and will be available in the ACM Digital Library

Quantum Computing Reliability: Problems, Tools, and Potential Solutions

Professor Paolo Rech (Università di Trento)

Abstract: Quantum computing is a new computational paradigm, expected to revolutionize the computing field in the next few years.

Qubits, the atomic units of a quantum circuit, exploit the quantum physics properties to increase the parallelism and speed of computation.

Unfortunately, qubits are both intrinsically noisy and highly susceptible to external sources of faults, such as ionizing radiation.

The reported qubits error rate is so high that researchers are questioning the large-scale adoption of quantum computers and forces unpractical mitigation solutions such as installing the quantum computer in underground caves.

Innovative solutions to improve the reliability of quantum applications are then highly necessary.

In the talk, after providing all information and background needed to understand quantum computing basics and an overview of the available quantum technologies vulnerabilities, we will present the available hardening solutions and the open challenges that need to be addressed. We will consider both the intrinsic noise, that has a predictable and incremental effect, and radiation-induced transient faults, that are stochastic and modify the qubit in an unpredictable way. Based on the latest studies and radiation experiments performed on real quantum machines, we will show how to model the transient faults in a qubit and how to inject this fault in a quantum circuit to track its propagation. We will discuss the vulnerability of qubits and of circuits, identifying the most critical parts and the main course for output corruption. Finally, we will provide an overview of the open (reliability) challenges in quantum computing to stimulate further studies and solutions.

Optimizing Write Performance for Checkpointing to Parallel File Systems Using LSM-Trees

Bulut, Wright

Recovery from Silent Data Corruption via Spatial Data Prediction

Guernsey, Placke, Poulos, Calhoun

Disk Failure Trends in Alpine Storage System

George, Hanley, Oral

Using Benford's Law to Identify Unusual Failure Regions

Ferreira, Levy 

Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts

Chen, Verrecchia, Sun, Booth, Raghavan

Evaluating the Resiliency of Posits for Scientific Computing

Schlueter, Calhoun, Poulos

When to checkpoint at the end of a fixed-length reservation?

Barbut, Benoit, Herault, Robert, Vivien

Workshop Topics

FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems.   We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:

Additional topics of interest include, but are not limited to: 

Submission Details

Submissions are solicited in the following categories:

Submissions shall be submitted to https://submissions.supercomputing.org/ and must conform to the requirements established by ACM at: https://www.acm.org/publications/proceedings-template.  LaTeX and MS Word templates are also available at this link.

Publication

Subject to publisher constraints, our workshop will publish all submissions accepted for inclusion in our workshop.

Reproducibility

Reproducibility is an important component of extreme-scale system research.  However, the goal of our workshop is to enourage and facilitate discussion of novel approaches and preliminary results.  As a result, it may not always be feasible to release reproducibility artifacts.  Moreover, to the greatest extent possible, we want to minimize unnecessary obstacles to socializing new ideas.  Therefore, while we encourage to make their work as public and reproducible as possible, we do not explicitly require it. 

Diversity & Inclusivity

As part of SC23, FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community (see here for more information SC23's commitment to inclusivity and diversity).   As a first step, we used an anonymous survey to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community.   The results of this survey are included below.   Because our committee comprises a relatively small number of people, we are not releasing exact numbers in an effort to protect their privacy.

Workshop Chair

Scott Levy - Sandia National Laboratories

Workshop Organizing Committee

Keita Teranishi – Oak Ridge National Laboratory

John Daly – Laboratory for Physical Sciences

Conflict Chairs

Qiang Guan - Kent State University

Bogdan Nicolae - Argonne National Laboratory

Program Committee

Aurelien Bouteiller - University of Tennessee, Knoxville

Chris Cantwell - Imperial College, London

Zizhong Chen - UC Riverside

Florina M. Ciorba - University of Basel

James Elliott - Sandia National Laboratories

Christian Engelmann - Oak Ridge National Laboratory

Bo Fang - Pacific Northwest National Laboratory

Wilfried Gansterer - University of Vienna

Qiang Guan - Kent State University

Haewon Jeong - Harvard University

Zhiling Lan - Illinois Institute of Technology

Jackson Mayo - Sandia National Laboratories

Thomas Naughton - Oak Ridge National Laboratory

Bogdan Nicolae - Argonne National Laboratory

Yves Robert - ENS Lyon, University of Tennessee

Thomas Ropars - University of Grenoble

Lipeng Wan - Georgia State University

Panruo Wu - University of Houston

Zhengji Zhao - NERSC / Lawrence Berkeley Laboratory