FTXS 2022

Workshop on Fault Tolerance for HPC at eXtreme Scale


Workshop Overview

Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).

Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed.  Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults. 

News & Announcements

Important Dates

All dates Anywhere-on-earth (AoE)

Workshop Details

Workshop Program

FTXS 2022 will be held on Monday, November 14, 2022.  Currently, the workshop is planned to be a hybrid event with the caveat that all of the components (remote and in-person) will be presented live.  The tentative schedule for the workshop is provided below.  All times are Central Standard Time (GMT-6): the time zone of Dallas, TX, USA where the conference will be held.  Papers will be published by the IEEE Computer Society and will be availablen in IEEE Xplore.

Silent data corruptions at scale

Dr. Harish Dixit (Facebook)

ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection (slides )

Egersdoerfer, Zhang, Dai

Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM (slides)

Fridman, Snir, Levin, Hendler, Attiya, Oren

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications (slides)

Fang, Hari, Tsai, Li, Gopalakrishnan, Laguna, Barker, Li

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms (slides)

Hübner, Hespe, Sanders, Stamatakis

Implicit Actions and Non-blocking Failure Recovery with MPI (slides)

Bouteiller, Bosilca


Workshop Topics

FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems.   We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:

Additional topics of interest include, but are not limited to: 

Submission Details

Submissions are solicited in the following categories:

Submissions shall be submitted to https://submissions.supercomputing.org/ and must conform to IEEE conference proceedings style.   IEEE templates are available at: www.ieee.org/conferences/publishing/templates.html.

Publication

Subject to publisher constraints, our workshop will publish all submission accepted for inclusion in our workshop.  We will publish our proceedings in cooperation with the IEEE Computer Society.

Reproducibility

Reproducibility is an important component of extreme-scale system research.  However, the goal of our workshop is to enourage and facilitate discussion of novel approaches and preliminary results.  As a result, it may not always be feasible to release reproducibility artifacts.  Moreover, to the greatest extent possible, we want to minimize unnecessary obstacles to socializing new ideas.  Therefore, while we encourage to make their work as public and reproducible as possible, we do not explicitly require it.   Authors may submit Artifact Description (AD) and Artifact Evaluation (AE) appendices consistent with the SC22 Reproducibility InitiativeWe are currently working to provide review of the AD/AE appendices that are submitted.

Diversity & Inclusivity

As part of SC22, FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community (see here for more information SC22's commitment to inclusivity and diversity).   As a first step, we used an anonymous survey to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community.   The results of this survey are included below.   Because our committee comprises a relatively small number of people, we are not releasing exact numbers in an effort to protect their privacy.

Workshop Chair

Scott Levy - Sandia National Laboratories

Conflicts Chair

Qiang Guan - Kent State University

Workshop Organizing Committee

Keita Teranishi – Sandia National Laboratories

John Daly – Laboratory for Physical Sciences

Program Committee

Aurelien Bouteiller - University of Tennessee, Knoxville

Chris Cantwell - Imperial College, London

Florina M. Ciorba - University of Basel

James Elliott - Sandia National Laboratories

Christian Engelmann - Oak Ridge National Laboratory

Bo Fang - Pacific Northwest National Laboratory

Wilfried Gansterer - University of Vienna

Qiang Guan - Kent State University

Amina Guermouche - Telecom SudParis

Haewon Jeong - Harvard University

Gokcen Kestor - Pacific Northwest National Laboratory

Zhiling Lan - Illinois Institute of Technology

Maria J. Martin - Universidade da Coruña

Jackson Mayo - Sandia National Laboratories

Bogdan Nicolae - Argonne National Laboratory

Yves Robert - ENS Lyon, University of Tennessee

Thomas Ropars - University of Grenoble

Lipeng Wan - Oak Ridge National Laboratory

Panruo Wu - University of Houston