The 2nd Fault Tolerance for HPC at eXtreme Scale (FTXS) 2012


 WHEN? June 25, 2012
 WHERE?  Boston, Massachusetts - USA
 VENUE?Boston Park Plaza Hotel
 IN ASSOCIATION WITH42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)
 CALL FOR PAPERSFTXS 2012 Call for Papers (CFP) available on Google Drive
 ATTENDEE LIST & SCHEDULEFTXS 2012 Attendee List and Schedule (PDF, 2 pages).  73 attendees pre-registered. 

Workshop Agenda

Below (slides) and (REFerences) are linked where available.

 8:30 8:35Welcome, Logistics and FTXS HistoryFTXS Organizers
 8:35 9:00INVITED TALK - Report From the Inter-Agency Workshop on HPC Resilience (slides)John Daly - Department of Defense / Center for Exceptional Computing
 9:00 9:25Chaotic-Identity Maps for Robustness Estimation of Exascale Computations (slides) (REF)Nageswara Rao
 9:25 9:50Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library (slides) (REF)Kathryn Mohror, Adam Moody and Bronis de Supinski
 9:50 10:00Facilitated Discussion 
 10:00 10:30Break 
 10:30 10:55Does Partial Replication Pay Off? (slides) (REF)Jon Stearley, Kurt Ferreira, David Robinson, Dorian Arnold, Patrick Bridges, Jim Laros, Kevin Pedretti and Rolf Riesen
 10:55 11:20Energy Considerations in Checkpointing and Fault Tolerance Protocols (slides) (REF)Mohammed el Mehdi DIOURI, Olivier GLÜCK, Laurent LEFEVRE and Franck CAPPELLO
 11:20 11:45A Programming Model for Resilience in Extreme Scale Computing (slides) (REF)Saurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas
 11:45 12:00Facilitated Discussion 
 12:00  1:30Lunch - PROVIDED 
 1:30 1:55ROSE::FFTransform - A Source-to-Source Transformation Framework for Exascale Fault-Tolerance Research (slides) (REF)Jacob Lidman, Daniel Quinlan, Chunhua Liao and Sally McKee
 1:55 2:20A Message-Logging Protocol for Multicore Systems (slides) (REF) Esteban Meneses, Xiang Ni and Laxmikant V. Kalé
 2:20 2:45An Evaluation of Difference and Threshold Techniques for Efficient Checkpoints (slides) (REF) Sean Hogan, Andrew Chien and Jeff Hammond
 2:45 3:00Facilitated Discussion 
 3:00 3:30Break 
 3:30 3:55On the Complexity of Scheduling Checkpoints for Computational Workflows (slides) (REF Yves Robert, Frédéric Vivien and Dounia Zaidouni
 3:55 4:20Design and Implementation of a Hardware Checkpoint/Restart Core (slides) (REF) Ashwin Mendon, Ron Sass, Zachary Baker and Justin Tripp
 4:20 4:45A Scalable Double In-Memory Checkpoint and Restart Scheme Towards Exascale (slides) (REF) Gengbin Zheng, Xiang Ni and Laxmikant Kale
 4:45 5:00 Wrap-up discussion, conclusions, takeaways, action items, next steps 

Submission Essential Information

Submissions are expected in the following categories:

  • Regular papers presenting innovative ideas improving the state of the art
  • Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
  • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results
Submissions shall be sent electronically, must conform to IEEE conference proceedings style and should not exceed six pages including all text, appendices, and figures. US Letter format, not A4.

All papers will be published, as workshop papers, in the DSN 2012 proceedings and on IEEE Xplore.
Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages (including all text, figures, and references), as per ACM 8.5 x 11 manuscript guidelines (document templates can be found at

Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Submission implies the willingness of at least one of the authors to register and present the paper.

Important Dates

Submission of papers: March 16, 2012 March 24, 2012 - 11:59 PM EST (DEADLINE EXTENDED!)
Author notification: April 13, 2012
Camera ready papers: April 27, 2012
Workshop: June 25, 2012

Workshop Topics

Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:

    • Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
    • Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
    • Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
    • Advances in monitoring, analysis, and control of highly complex systems
    • Highly scalable fault-tolerant programming models
    • Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
    • Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
    • Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
    • Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress

Workshop Organizers

Nathan DeBardeleben - Los Alamos National Laboratory
Jon Stearley - Sandia National Laboratory
Franck Cappello - INRIA and University of Illinois at Urbana Champaign

Program Committee

George Bosilca - University of Tennessee, Knoxville
Greg Bronevetsky - Lawrence Livermore National Laboratory
John Daly - Department of Defense
Christian Engelmann - Oak Ridge National Laboratory
Kurt Ferreira - Sandia National Laboratories
Ana Gainaru - University of Illinois, Urbana-Champaign
Hideyuki Jitsumoto - University of Tokyo
Zbigniew Kalbarczyk - University of Illinois, Urbana-Champaign
Rakesh Kumar - University of Illinois, Urbana-Champaign
Zhiling Lan - Illinois Institute of Technology
Bogdan Nicolae - INRIA
Yve Robert - ENS Lyon
Roel Wuyts - (Intel ExaScience Lab, Leuven, Belgium) and KU Leuven (Leuven, Belgium)
Felix Salfner - SAP Innovation Center Potsdam
Mitsuhisa Sato - University of Tsukuba
Stephen Scott - Oak Ridge National Laboratory and Tennessee Tech University


