The 4th Fault Tolerance for HPC at eXtreme Scale (FTXS) 2014

Workshop Agenda



Authors, Speaker, or Moderator

8:00 Welcome and opening remarks Nathan DeBardeleben
8:15Scale Changes Everything: Fault Tolerance at the Exascale (slides)Dr. Lucy Nowell, DOE ASCR Program Manager
 SESSION 1: Checkpoint/Restart Modeling and Message LoggingModerator: Nathan DeBardeleben 
9:00Coarse-grained Energy Modeling of Rollback/Recovery Mechanisms (slides) Dewan Ibtesham, David Debonis, Kurt Ferreira and Dorian Arnold
9:30Grid-Oriented Process Clustering System for Partial Message Logging (slides)Hideyuki Jitsumoto, Yuki Todoroki, Yutaka Ishikawa and Mitsuhisa Sato 
10:00Break Break 
 SESSION 2: Application and Algorithm ResiliencyModerator: Robert Clay 
10:30Evaluating the Error Resilience of Parallel Programs (slides)Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi 
11:00Comparison Criticality in Sorting Algorithms (slides)Thomas Jones and David Ackley 
11:30Break  Lunch
  SESSION 3: Hardware - Reliability Studies and Tailored Resilience TechniquesModerator: Franck Cappello 
13:15Radiation Sensitivity of High Performance Computing Applications on Kepler-Based GPGPUs (slides)Daniel A. G. Oliveira, Caio B. Lunardi, Laércio L. Pilla, Paolo Rech, Philippe Navaux and Luigi Carro 
13:45HeteroCheckpoint: Efficient Checkpointing for Accelerator-based Systems (slides)Sudarsun Kannan, Naila Farooqui, Ada Gavrilovska and Karsten Schwan 
14:15Harnessing Unreliable Cores In Heterogeneous Architecture: The PyDac Programming Model and Runtime (slides) Bin Huang, Ron Sass, Nathan DeBardeleben and Sean Blanchard
14:45 BreakBreak 
 SESSION 4: Resiliency in HPC MessagingModerator: Nathan DeBardeleben 
15:00Design and Evaluation of FA-MPI, A Transactional Resilience Scheme for Non-blocking MPI (slides)Amin Hassani, Anthony Skjellum and Ron Brightwell 
15:30Extreme-scale viability of collective communication for resilient task scheduling and work stealing (slides) Jeremiah Wilke, Janine Bennett, Hemanth Kolla, Keita Teranishi, Nicole Slattengren and John Floren
16:00 Open Discussion, Felt Needs, Next Steps, etc.All Attendees
16:45Closing Remarks FTXS 2014 Organizers 
17:00 Adjourned 

Submission Essential Information

Submissions are expected in the following categories:
  • Regular papers presenting innovative ideas improving the state of the art
  • Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
  • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results
Authors are invited to submit papers with unpublished, original work of not more than 6 pages. Please follow the US Letter guidelines for IEEE style and templates here

Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library. Submission implies the willingness of at least one of the authors to register and present the paper.

Submit a paper here.

Important Dates

Submission of papers: March 7, 2014 - DEADLINE EXTENDED TO March 14, 2014
Author notification: March 21, 2014 - EXTENDED TO April 11, 2014
Camera ready papers: April TBA, 2014
Workshop: June 23, 2014

Workshop Topics

Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:
  • Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
  • Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
  • Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
  • Advances in monitoring, analysis, and control of highly complex systems
  • Highly scalable fault-tolerant programming models
  • Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
  • Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
  • Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
  • Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress
  • Frameworks and APIs for fault tolerance and fault management.

Workshop Organizers

Nathan DeBardeleben - Los Alamos National Laboratory
Franck Cappello - Argonne National Laboratory and the University of Illinois at Urbana Champaign
Robert Clay - Sandia National Laboratories

Program Committee

Rob Aulwes – Los Alamos National Laboratory
Aurélien Bouteiller – University of Tennessee Knoxville
Greg Bronevetsky - Lawrence Livermore National Laboratory
John Daly - Department of Defense
Christian Engelmann – Oak Ridge National Laboratory
Kurt Ferreira – Sandia National Laboratories
Ana Gainaru – University of Illinois at Urbana-Champaign
Leonardo Bautista Gomez – Tokyo Institute of Technology
Hideyuki Jitsumoto – The University of Tokyo
Zhiling Lan – Illinois Institute of Technology
Naoya Maruyama – RIKEN Advanced Institute for Computational Science
Kathryn Mohror – Lawrence Livermore National Laboratory
Bogdan Nicolae – IBM Research – Ireland
Rolf Riesen – IBM Research – Ireland
Yve Robert - ENS Lyon
Thomas Ropars - EPFL
Stephen Scott – Tennessee Tech University and Oak Ridge National Laboratory
Vilas Sridharan – AMD, Inc.
Abhinav Vishnu - Pacific Northwest National Laboratory
Roel Wuyts - Intel ExaScience Lab