The 3rd Fault Tolerance for HPC at eXtreme Scale (FTXS) 2013


Workshop Agenda

Presenter is underlined, (slides) and (REFerences) are linked where available.

 8:30 8:45Welcome, Logistics and FTXS HistoryNathan DeBardeleben - Los Alamos National Laboratory
  SESSION: Algorithms and Applications 
 8:45 9:30INVITED TALK - Toward Resilient Algorithms and Applications (slides)Mike Heroux - Sandia National Laboratories
 9:30 10:00Fault Tolerance Using Lower Fidelity Data in Adaptive Mesh Applications (slides), (REF)Anshu Dubey, Prateeti Mohapatra and Klaus Weide
 10:00 10:30Break 
  SESSION: Hardware Issues 
 10:30 11:15INVITED TALK - Circuits for Resilient Systems (slides)Vivek De - Intel
 11:15 11:45Neutron Sensitivity and Software Hardening Strategies for Matrix Multiplication and FFT on Graphics Processing Units (slides) (REF)Paolo Rech, Laercio Pilla, Francesco Silvestri, Philippe Navaux and Luigi Carro
 11:45 1:30Lunch - on your own 
   SESSION: Injection, Detection, and Replication 
 1:30 2:00 Using Unreliable Virtual Hardware to Inject Errors in Extreme-Scale Systems (slides) (REF) Scott Levy, Matthew G. F. Dosanjh, Patrick G. Bridges and Kurt B. Ferreira
 2:00 2:30 Fault Detection in Multi-Core Processors Using Chaotic Maps (slides) (REF) Nageswara Rao
 2:30 3:00 Replication for Send-Deterministic MPI HPC Applications (slides) (REF) Arnaud Lefray, Thomas Ropars and André Schiper
 3:00 3:30Break 
  SESSION: Energy and Checkpointing 
 3:30 4:00 Energy-aware I/O Optimization for Checkpoint and Restart on a NAND Flash Memory System (slides) (REF) Takafumi Saito, Kento Sato, Hitoshi Sato and Satoshi Matsuoka
 4:00 4:30 When is Multi-version Checkpointing Needed (slides) (REF)Guoming Lu, Ziming Zheng and Andrew A. Chien
  SESSION: Wrap-up discussion, conclusions, takeaways, action items, next steps 

Submission Essential Information

Submissions are expected in the following categories:

  • Regular papers presenting innovative ideas improving the state of the art
  • Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
  • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results

Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages (including all text, figures, and references), as per ACM 8.5 x 11 manuscript guidelines (document templates can be found at

Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Submission implies the willingness of at least one of the authors to register and present the paper.

Submit a paper here.

Important Dates

Submission of papers: February 11, 2013 February 18, 2013 - 11:59 PM EST (DEADLINE EXTENDED!)
Author notification: March 18, 2013
Camera ready papers: April 15, 2013
Workshop: June 18, 2013

Workshop Topics

Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:

    • Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
    • Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
    • Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
    • Advances in monitoring, analysis, and control of highly complex systems
    • Highly scalable fault-tolerant programming models
    • Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
    • Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
    • Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
    • Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress

Workshop Organizers

Nathan DeBardeleben - Los Alamos National Laboratory
Jon Stearley - Sandia National Laboratory
Franck Cappello - INRIA and University of Illinois at Urbana Champaign

Program Committee

Rob Aulwes – Los Alamos National Laboratory
Aurélien Bouteiller – University of Tennessee, Knoxville
Greg Bronevetsky - Lawrence Livermore National Laboratory
Clayton Chandler – Department of Defense
Robert Clay – Sandia National Laboratories
John Daly - Department of Defense
Christian Engelmann – Oak Ridge National Laboratory
Felix Salfner - SAP Innovation Center Potsdam
Kurt Ferreira – Sandia National Laboratories
Ana Gainaru – University of Illinois at Urbana-Champaign
Leonardo Bautista Gomez – Tokyo Institute of Technology
Hideyuki Jitsumoto – The University of Tokyo
Rakesh Kumar - University of Illinois, Urbana-Champaign
Zhiling Lan – Illinois Institute of Technology
Naoya Maruyama – Tokyo Institute of Technology
Kathryn Mohror – Lawrence Livermore National Laboratory
Bogdan Nicolae – IBM Research – Ireland
Rolf Riesen – IBM Research – Ireland
Yve Robert - ENS Lyon
Thomas Ropars - EPFL
Mitsuhisa Sato – University of Tsukuba
Stephen Scott – Tennessee Tech University and Oak Ridge National Laboratory
Vilas Sridharan – AMD, Inc.
Roel Wuyts - Intel ExaScience Lab


Please address FTXS workshop questions to Nathan DeBardeleben, Los Alamos National Laboratory (