FTXS 2022
Workshop on Fault Tolerance for HPC at eXtreme Scale
Workshop Overview
Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).
Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed. Therefore, we are broadly interested in forward-looking papers that seek to characterize and mitigate the impact of faults.
News & Announcements
The SC22 Presenter Slide Template is available here: here
Harish Dixit from Facebook will be our Featured Speaker, discussing work at Facebook on silent data corruption in large-scale systems!
IEEE Computer Society has agreed to publish our proceedings!
FTXS 2022 has been accepted to SC22 in November in Dallas, TX, USA!
Important Dates
Paper submissions open: July 1, 2022
Paper submission closes: August 11, 2022 August 18,2022
Author notification: September 8, 2022 September 13, 2022 September 15, 2022
Camera-ready papers: October 7, 2022
Workshop: Monday, November 14, 2022 (1:30-5:00pm CST)
All dates Anywhere-on-earth (AoE)
Workshop Details
WHEN : Monday, November 14, 2022 (1:30-5:00pm CST)
WHERE : Dallas, TX, USA
VENUE : Kay Bailey Hutchison Convention Center (Room C143-149, map )
REGISTRATION : Register to attend SC22 HERE (registration is open; early registration closes October 14, 2022)
SUBMISSION : Papers should be submitted at: https://submissions.supercomputing.org/
UPDATES : Follow us on Twitter ( @ftxsworkshop ) for the latest news and updates on the workshop
QUESTIONS : contact Scott Levy (sllevy@sandia.gov)
Workshop Program
FTXS 2022 will be held on Monday, November 14, 2022. Currently, the workshop is planned to be a hybrid event with the caveat that all of the components (remote and in-person) will be presented live. The tentative schedule for the workshop is provided below. All times are Central Standard Time (GMT-6): the time zone of Dallas, TX, USA where the conference will be held. Papers will be published by the IEEE Computer Society and will be availablen in IEEE Xplore.
[1:30-1:35pm] Opening remarks
[1:35-2:35pm] Featured speaker
Silent data corruptions at scale
Dr. Harish Dixit (Facebook)
[2:35-3:00pm] Regular paper
ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection (slides )
Egersdoerfer, Zhang, Dai
[3:00-3:30pm] SC22 Afternoon break
[3:30-3:55pm] Regular paper
Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM (slides)
Fridman, Snir, Levin, Hendler, Attiya, Oren
[3:55-4:10pm] Short paper
Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications (slides)
Fang, Hari, Tsai, Li, Gopalakrishnan, Laguna, Barker, Li
[4:10-4:35pm] Regular paper
ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms (slides)
Hübner, Hespe, Sanders, Stamatakis
[4:35-5:00pm] Regular paper
Implicit Actions and Non-blocking Failure Recovery with MPI (slides)
Bouteiller, Bosilca
[5:00pm] Closing remarks
Workshop Topics
FTXS is broadly interested in research on characterizing and mitigating the impart of faults on HPC systems. We are particularly interested in papers that address issues related to the following developments in extreme-scale systems:
Storage Devices: The storage hierarchy on HPC systems continues to increase in depth and complexity. SSDs and NVMe add high-speed node-local (or rack-local) persistent storage that can be used to improve the performance of checkpoint/restart or otherwise facilitate application resilience. Continuing to efficiently exploit these devices remains critical for extreme-scale HPC systems. Moreover, the recent availability of Non-Volatile Memory Modules (NVMMs) has begun to blur the line between memory and storage. The implications of this blurring for fault tolerance on extreme-scale systems are still being explored.
System Heterogeneity: Modern HPC systems increasingly include GPUs, FPGAs, and other types of accelerators. New networking devices like Data Processing Units (DPUs) and SmartNICs are also starting to be deployed. However, there are many resilience and fault tolerance issues associated with these devices that still need to be resolved. Papers at prominent recent conferences (including SC20, ICS 2019, and IEEE Cluster 2018) demonstrate that understanding the fault tolerance implications of heterogeneous compute devices is an important and active area of research.
Computing Paradigms: Novel non-von Neumann computing paradigms, including quantum and neuromorphic computing, have attracted significant research interest. Recent publications demonstrate that understanding the fault tolerance implications of these computing paradigms is also an area of active research.
Machine Learning: Algorithms that rely on elements of machine learning are becoming more and more prevalent on HPC systems. Understanding how these algorithms react and respond to the frequency and variety of faults that occur on HPC systems is critical to ensuring that they continue to provide accurate and timely answers.
Additional topics of interest include, but are not limited to:
Algorithmic-Based Fault Tolerance (ABFT) techniques to address undetected (silent) errors
Silent data corruption (SDC) detection / correction techniques
Novel fault-tolerance techniques and implementations
Failure data analysis and field studies
Power, performance, resilience (PPR) assessments / tradeoffs
Emerging hardware and software technology for resilience
Advances in reliability monitoring, analysis, and control of highly complex systems
Failure prediction, error preemption, and recovery techniques
Fault-tolerant programming models
Models for software and hardware reliability
Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
Near-threshold-voltage implications and evaluations for reliability
Benchmarks and experimental environments including fault injection
Frameworks and APIs for fault-tolerance and fault management
Submission Details
Submissions are solicited in the following categories:
Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation. Regular papers should not exceed ten (10) pages including all text, appendices, and figures, but excluding references. Accepted regular papers that meet these requirements will be published (subject to publisher constraints).
Extended abstracts presenting preliminary results, proposing disruptive ideas, or challenging assumptions in the field. The inclusion of some form of preliminary results is encouraged. Extended abstract papers should not exceed four (4) pages, not including references. Extended abstracts will be evaluated separately and given shorter oral presentations. We are currently working with a publisher to provide an opportunity to publish extended abstracts.
Submissions shall be submitted to https://submissions.supercomputing.org/ and must conform to IEEE conference proceedings style. IEEE templates are available at: www.ieee.org/conferences/publishing/templates.html.
Publication
Subject to publisher constraints, our workshop will publish all submission accepted for inclusion in our workshop. We will publish our proceedings in cooperation with the IEEE Computer Society.
Reproducibility
Reproducibility is an important component of extreme-scale system research. However, the goal of our workshop is to enourage and facilitate discussion of novel approaches and preliminary results. As a result, it may not always be feasible to release reproducibility artifacts. Moreover, to the greatest extent possible, we want to minimize unnecessary obstacles to socializing new ideas. Therefore, while we encourage to make their work as public and reproducible as possible, we do not explicitly require it. Authors may submit Artifact Description (AD) and Artifact Evaluation (AE) appendices consistent with the SC22 Reproducibility Initiative. We are currently working to provide review of the AD/AE appendices that are submitted.
Diversity & Inclusivity
As part of SC22, FTXS is fully committed to addressing diversity and inclusivity at our workshop and in the larger HPC fault tolerance community (see here for more information SC22's commitment to inclusivity and diversity). As a first step, we used an anonymous survey to collect demographic information about our Program Committee to ensure that we can measure our progress and so that we can be held accountable by the HPC community. The results of this survey are included below. Because our committee comprises a relatively small number of people, we are not releasing exact numbers in an effort to protect their privacy.
Approximately 50% of our Program Committee completed our anonymous demographic survey
GEOGRAPHY
Approximately 3/5 of respondents reported North America as their primary work location
Approximately 2/5 of respondents reported Europe as their primary work location
GENDER
Approximately 4/5 of respondents identify as male
Approximately 1/5 of respondents identify as female
RACIAL & ETHNIC GROUPS
No respondents identified as a racial or ethnic minority where they work
Workshop Chair
Scott Levy - Sandia National Laboratories
Conflicts Chair
Qiang Guan - Kent State University
Workshop Organizing Committee
Keita Teranishi – Sandia National Laboratories
John Daly – Laboratory for Physical Sciences
Program Committee
Aurelien Bouteiller - University of Tennessee, Knoxville
Chris Cantwell - Imperial College, London
Florina M. Ciorba - University of Basel
James Elliott - Sandia National Laboratories
Christian Engelmann - Oak Ridge National Laboratory
Bo Fang - Pacific Northwest National Laboratory
Wilfried Gansterer - University of Vienna
Qiang Guan - Kent State University
Amina Guermouche - Telecom SudParis
Haewon Jeong - Harvard University
Gokcen Kestor - Pacific Northwest National Laboratory
Zhiling Lan - Illinois Institute of Technology
Maria J. Martin - Universidade da Coruña
Jackson Mayo - Sandia National Laboratories
Bogdan Nicolae - Argonne National Laboratory
Yves Robert - ENS Lyon, University of Tennessee
Thomas Ropars - University of Grenoble
Lipeng Wan - Oak Ridge National Laboratory
Panruo Wu - University of Houston