FTXS 2021

Workshop on Fault Tolerance for HPC at eXtreme Scale

Looking for FTXS 2022? You can find it at: https://sites.google.com/view/ftxs2022

Workshop Overview

Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).

Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed. Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults.

Held in conjunction with:

In cooperation with:

News & Announcements

  • The submission deadline has been extended to September 5.

  • We are thrilled to announce that Dr. Catherine Schuman from Oak Ridge National Laboratory (ORNL) will be the featured speaker at FTXS 2021! She will discuss her work on fault tolerance and resilience in neuromorphic systems. Watch this space for details.

  • We are also excited to announce that Dr. Devesh Tiwari from Northeastern University will be giving an invited talk at FTXS 2021! He will discuss his work on reliability and fault tolerance in quantum computing systems. Watch this space for details.

Remote Presentation Details

Guidance and resources for preparing your remote presentation are contained in the following documents:

Workshop Schedule

FTXS 2021 will be held on Sunday, November 14, 2021. Currently, the workshop is planned to be a fully virtual event with the all of the components being remote/virtual. The tentative schedule for the workshop is provided below. All times are Central Standard Time (GMT-6): the time zone of St. Louis, MO, USA where the conference will be held. Links to the extended abstracts are posted below. Regular papers will be published in IEEE Xplore.

    • [9:00-9:05am] Opening remarks

    • [9:05-10:05am ] Featured speaker

Fault Tolerance and Resilience in Neuromorphic Systems

Dr. Catherine Schuman (Oak Ridge National Laboratories)

    • [10:05-10:30am] BREAK

    • [10:30-11:00am] Regular Paper (slides)

Doubt and Redundancy Kill Soft Errors—Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

Samfass, Weinzierl, Reinarz, Bader

    • [11:00-11:30am] Regular Paper (slides)

Assessing the Use Cases of Persistent Memory in High-Performance Scientific Computing

Oren, Fridman

    • [11:30am-12:00pm] Regular paper (slides)

Statistical Framework for Two-Party Acceptance Testing of HPC Systems for Reliability

DeBardeleben, Burr, Penton, Walker, Loncaric, Jones

Accelerating checkpoint/restart with lossy methods

Ildes, Kastoras, Keller, Bautista Gomez

    • [12:15-12:30am] Extended abstract (paper)

Characterizing Per-node Memory Failures Using Benford’s Law

Ferreira, Levy

    • [12:30-2:00pm] LUNCH BREAK

    • [2:00-3:00pm] Invited Speaker

Making Erroneous Executions on Quantum Computers Meaningful

Dr. Devesh Tiwari, Northeastern University

    • [3:00-3:30pm] BREAK

    • [3:30-4:00pm] Regular paper

Incorporating Fault-Tolerance Awareness into System-Level Modeling and Simulation

Johnson, Lam

    • [4:00-4:30pm] Regular paper (slides)

Relaxed Replication for Energy Efficient and Resilient GPU Computing

Miao, Calhoun, Ge

    • [4:30-4:35pm] Closing remarks

Important Dates

  • Paper submissions open: July 1, 2021

  • Paper submission closes: August 27, 2021 September 5, 2021

  • Author notification: September 27, 2021 September 30, 2021

  • Camera-ready papers: October 15, 2021

  • Workshop: November 14, 2021

All dates Anywhere-on-earth (AoE)

Workshop Details

  • WHEN : November 14 , 2021

  • WHERE : St. Louis, MO, USA

  • VENUE : America's Center (Room TBD)

  • REGISTRATION : Register to attend SC21 HERE (registration is scheduled to open July 14, 2021)

  • SUBMISSION : Papers should be submitted at: https://submissions.supercomputing.org/

  • UPDATES : Follow us on Twitter ( @ftxsworkshop ) for the latest news and updates on the workshop

  • QUESTIONS : contact Scott Levy (sllevy@sandia.gov)

Workshop Topics

FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems. We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:

  • Storage Devices: The storage hierarchy on HPC systems continues to increase in depth and complexity. SSDs and NVMe add high-speed node-local (or rack-local) persistent storage that can be used to improve the performance of checkpoint/restart or otherwise facilitate application resilience. Continuing to efficiently exploit these devices remains critical for extreme-scale HPC systems. Moreover, the recent availability of Non-Volatile Memory Modules (NVMMs) has begun to blur the line between memory and storage. The implications of this blurring for fault tolerance on extreme-scale systems are still being explored.

  • System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are also starting to be deployed. However, there are many resilience and fault tolerance issues associated with these devices that still need to be resolved. Papers at prominent recent conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding the fault tolerance implications of heterogeneous compute devices is an important and active area of research.

  • Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and neuromorphic computing, have attracted significant research interest. Recent publications demonstrate that understanding the fault tolerance implications of these computing paradigms is also an area of active research.

  • Machine Learning: Algorithms that rely on elements of machine learning are becoming more and more prevalent on HPC systems. Understanding how these algorithms react and respond to the frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to provide accurate and timely answers.

Additional topics of interest include, but are not limited to:

  • Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors

  • Silent data corruption (SDC) detection / correction techniques

  • Novel fault-tolerance techniques and implementations

  • Failure data analysis and field studies

  • Power, performance, resilience (PPR) assessments / tradeoffs

  • Emerging hardware and software technology for resilience

  • Advances in reliability monitoring, analysis, and control of highly complex systems

  • Failure prediction, error preemption, and recovery techniques

  • Fault-tolerant programming models

  • Models for software and hardware reliability

  • Metrics and standards for measuring, improving, and enforcing effective fault-tolerance

  • Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations

  • Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)

  • Near-threshold-voltage implications and evaluations for reliability

  • Benchmarks and experimental environments including fault injection

  • Frameworks and APIs for fault-tolerance and fault management

Submission Details

Submissions are solicited in the following categories:

  • Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation. Regular papers should be at least six (6) pages but should not exceed ten (10) pages including all text, appendices, figures, and references. Accepted regular papers that meet these requirements will be published in cooperation with IEEE TCHPC (subject to publisher conditions regarding the number of papers accepted by the workshop). Extended abstracts should not exceed three (3) pages. Extended abstracts will be posted on our website but will NOT be published.

  • Extended abstracts presenting preliminary results, proposing disruptive ideas, or challenging assumptions in the field. The inclusion of some form of preliminary results is encouraged. Extended abstract papers should not exceed three (3) pages, not including references. Extended abstracts will be evaluated separately and given shorter oral presentations, but they will NOT be published in the workshop proceedings.

Submissions shall be submitted to https://submissions.supercomputing.org/ and must conform to IEEE conference proceedings style. IEEE templates are available at: www.ieee.org/conferences/publishing/templates.html.

Diversity & Inclusivity

As part of SC21, FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community (see here for more information SC21's commitment to inclusivity and diversity). As a first step, we used an anonymous survey to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community. The results of this survey are included below. Because our committee comprises a relatively small number of people, we are not releasing exact numbers in an effort to protect their privacy.

  • Approximately 75% of our Program Committee completed our anonymous demographic survey

  • GEOGRAPHY

    • Approximately 1/2 of respondents reported North America as their primary work location

    • Approximately 1/2 of respondents reported Europe as their primary work location

  • GENDER

    • Approximately 3/4 of respondents identify as male

    • Approximately 1/4 of respondents identify as female

  • RACIAL & ETHNIC GROUPS

    • Approximately 4/5 of respondents do not identify as a racial or ethnic minority where they work

    • Approximately 1/5 of respondents do identify as a racial or ethnic minority where they work

Workshop Chair

Scott Levy - Sandia National Laboratories

Conflicts Chair

Qiang Guan - Kent State University

Workshop Organizing Committee

Keita Teranishi – Sandia National Laboratories

John Daly – Laboratory for Physical Sciences

Program Committee

Aurelien Bouteiller - University of Tennessee, Knoxville

Chris Cantwell - Imperial College, London

Florina M. Ciorba - University of Basel

James Elliott - Sandia National Laboratories

Christian Engelmann - Oak Ridge National Laboratory

Bo Fang - Pacific Northwest National Laboratory

Wilfried Gansterer - University of Vienna

Qiang Guan - Kent State University

Amina Guermouche - Telecom SudParis

Haewon Jeong - Harvard University

Gokcen Kestor - Pacific Northwest National Laboratory

Zhiling Lan - Illinois Institute of Technology

Maria J. Martin - Universidade da Coruña

Jackson Mayo - Sandia National Laboratories

Bogdan Nicolae - Argonne National Laboratory

Sarunya Pumma - AMD

Paolo Rech - UFRGS, Politecnico di Torino

Yves Robert - ENS Lyon, University of Tennessee

Thomas Ropars - University of Grenoble

Lipeng Wan - Oak Ridge National Laboratory

Panruo Wu - University of Houston