Held in conjunction with:
In cooperation with:

The 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019

WHEN?   Friday, November 22, 2019 8:30a-12:00p
WHERE?   Denver, CO, USA
VENUE? Colorado Convention Center (Room 301-302-303)
IN ASSOCIATION WITH? The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC19)
REGISTER See SC19 Registration page (SC19 registration opens July 11, 2019)
PAST FTXSs See sidebar for previous 8 FTXSs
 QUESTIONS?
Contact Scott Levy (sllevy@sandia.gov)

Workshop Program
FTXS 2019 will be held on Friday, November 22 in Room 301-302-303 (map).  The schedule is provided below.  The name of the presenter of each paper is in italics.

8:40-8:45 Opening remarks
   Session I  (chair: Scott Levy)
8:45-9:10 Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications
Losada, Bouteiller, Bosilca
[ slides ]
9:10-9:35 Enforcing Crash Consistency of Scientific Applications in Non-Volatile Main Memory Systems
Coy, Zhang
[ slides ]
9:35-10:00 FaultSight: A Fault Analysis Tool for HPC Researchers
Horn, Fulp, Calhoun, Olson
[ slides ]
10:00-10:30 coffee break
   Session II (chair: Scott Levy)
10:30-10:55 Self-stabilizing Connected Components
Sao, Engelmann, Eswar, Green, Vuduc
[ slides ]
10:55-11:20 Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors
Chang, Li, Erez
[ slides ]
11:20-11:45 Node-failure-resistant preconditioned conjugate gradient method without replacement nodes
Pachajoa, Pacher, Gansterer
[ slides ]
11:45-12:00 Closing remarks

Essential Submission Information

Get a printable copy of our CFP here

Workshop Topics

Authors are invited to submit original papers on the research and practice of fault-tolerance in extreme-scale distributed systems (primarily HPC systems, but including grid and cloud systems).  Resilience and fault-tolerance remain a major concern for supercomputing and advances in this area are needed to allow applications to compute accurate (or within an acceptable error tolerance) answers in a timely and efficient manner in the presence of degradations or failures of platform components (both hardware and software).

Topics include, but are not limited to:
  • Failure data analysis and field studies
  • Power, performance, resilience (PPR) assessments / tradeoffs
  • Novel fault-tolerance techniques and implementations
  • Emerging hardware and software technology for resilience
  • Silent data corruption (SDC) detection / correction techniques
  • Advances in reliability monitoring, analysis, and control of highly complex systems
  • Failure prediction, error preemption, and recovery techniques
  • Fault-tolerant programming models
  • Models for software and hardware reliability
  • Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
  • Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
  • Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
  • Near-threshold-voltage implications and evaluations for reliability
  • Benchmarks and experimental environments including fault injection
  • Frameworks and APIs for fault-tolerance and fault management

Submission Details

Submissions are solicited in the following categories:
  • Regular papers presenting innovative ideas improving the state of the art or discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation.
  • Extended abstracts proposing disruptive ideas and challenging assumptions in the field, including some form of preliminary results.
Extended abstracts will be evaluated separately and given shorter oral presentations.

Submissions shall be sent electronically and must conform to SC18 proceedings style. Regular papers should not exceed ten (10) pages including all text, appendices, figures, and references. Extended abstract papers should not exceed six (6) pages.  Please note that we have only placed a limit on the maximum number of pages that a submission may contain.  Papers that are clear, coherent, and complete (with the understanding that the submission may represent a work-in-progress) but are shorter than this maximum are encouraged.

Papers should be submitted to: https://submissions.supercomputing.org.  A sample submission form is available (TBD)

Submitted papers will be peer-reviewer and will receive a minimum of three reviews.  Accepted papers will be published by IEEE TCHPC (and included in IEEE Xplore).

Authors are encouraged to include reproducibility artifacts as described on the conference website:
https://sc19.supercomputing.org/submit/sc-reproducibility-initiative
Inclusion of reproducibility artifacts is optional.

Important Dates

Submission of papers: August 27, 2019 September 3, 2019 (anywhere-on-earth)
Author notification: September 27, 2019 (anywhere-on-earth)
Camera ready papers: October 11, 2019
Workshop: November 22, 2019

Workshop Co-chairs

Scott Levy - Sandia National Laboratories
Nathan DeBardeleben - Los Alamos National Laboratory

Workshop Organizing Committee

Keita Teranishi – Sandia National Laboratories
John Daly – Laboratory for Physical Sciences

Program Committee

Rizwan Ashraf — Oak Ridge National Laboratory
Leonardo Bautista-Gomez — Barcelona Supercomputing Center
Aurelien Bouteiller — University of Tennessee
Chris Cantwell — Imperial College, London
Florina M. Ciorba — University of Basel
James Elliott — Sandia National Laboratories
Christian Engelmann — Oak Ridge National Laboratory
Kurt B. Ferreira — Sandia National Laboratories
Wilfried Gansterer — University of Vienna
Qiang Guan — Kent State University
Sudhanva Gurumurthi — Advanced Micro Devices Inc
Zhiling Lan — Illinois Institute of Technology
Naoya Maruyama — Lawrence Livermore National Laboratory
Jackson Mayo — Sandia National Laboratories
Bogdan Nicolae — Argonne National Laboratory
Yves Robert — ENS Lyon, University of Tennessee
Abhinav Vishnu — Advanced Micro Devices (AMD) Inc
Panruo Wu — University of Houston

Illustration of Denver skyline is a derivative of Denver Skyline by Hogs555, used under CC BY-SA 3.0