The 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016

LAST MINUTE UPDATE: HPDC registration opens at noon but FTXS starts at 10:50am on Tuesday, May 31st.  HPDC has said to get your badges during a break so please don't let a lack of a badge stop you from coming to FTXS in the late morning.

UPDATE: Submission extension to February 19th, 2016 (see below for instructions)

WHEN?   May 31st, 2016
WHERE?   Kyoto, Japan
VENUE? Kyoto International Community House
ROOM? Seminar Room
The 25th International Symposium on High Performance Parallel and Distributed Computing
REGISTER See HPDC web site    
PAST FTXSs See sidebar for 2010, 2012, 2013, 2014, and 2015
CALL FOR PAPERS FTXS 2016 Call for Papers (CFP) available on Google Drive

Workshop Keynote

Fumiyoshi Shoji - RIKEN, Advanced Institute for Computational Science
Director - Operations and Computer Technologies Division
More information to be announced later

Workshop Agenda

6 high quality papers plus a keynote by RIKEN's Advanced Institute for Computational Science Director of Operations and Computer Technologies Division, Fumiyoshi Shoji make for an exciting day for FTXS 2016!



Authors, Speaker, or Moderator

10:50am  Welcome and opening remarks  Nathan DeBardeleben
  SESSION 1 Moderator: Nathan DeBardeleben 
11:00am Fault Tolerance in the Parareal Method (slides)
Allan Nielsen and Jan Hesthaven
11:30am A Self-Correcting Connected Components Algorithm (slides)
Piyush Sao, Oded Green, Chirag Jain and Richard Vuduc
12:00pm Lunch Lunch
  SESSION 2: Keynote Moderator: Atsushi Hori
1:30pm Keynote: The K Computer and Its Failures (slides)
Fumiyoshi Shoji

SESSION 3 Moderator: Keita Teranishi
2:30pm ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner (slides)
Francesco Rizzi, Karla Morris, Khachik Sargsyan, Paul Mycek, Cosmin Safta, Olivier Le Maitre, Omar Knio and Bert Debusschere
3:00pm Adding Fault Tolerance to NPB Benchmarks Using ULFM (slides)
Zachary Parchman, Geoffroy Vallee, Thomas Naughton, Christian Engelmann and David E. Bernholdt 
3:30pm Break Break
  SESSION 4  Moderator: Nathan DeBardeleben 
4:00pm An Examination of the Impact of the Failure Distribution on Coordinated Checkpoint/Restart (slides)
Scott Levy and Kurt Ferreira
4:30pm In-Situ Mitigation of Silent Data Corruption in PDE Solvers (slides)
Maher Salloum, Jackson Mayo and Rob Armstrong
5:00pm Closing Remarks FTXS 2016 Organizers
5:10pm Adjourned -

Submission Essential Information

Submissions are expected in the following categories:
  • Regular papers presenting innovative ideas improving the state of the art and experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
  • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results
Extended abstracts will be evaluated separately and given shorter oral presentations.

Authors are invited to submit papers with unpublished, original work of a maximum of eight (8) pages for normal papers and four (4) to six (6) pages for extended abstracts. Please follow the US Letter guidelines for ACM Proceedings Style. 

Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library. Submission implies the willingness of at least one of the authors to register and present the paper.

Important Dates

Submission of papers: February 13th, 2016 EXTENDED: February 19th, 2016
Author notification: March 12th, 2016 (this date may shift due to submission extension) EXTENDED: March 21st, 2016 due to submission extension
Camera ready papers: March 27th, 2016 UPDATE FROM HPDC: April 19, 2016
Workshop: May 31st, 2016

Workshop Topics

Topics include, but are not limited to:
  • Failure data analysis and field studies
  • Power, performance, resilience (PPR) assessments / tradeoffs
  • Novel fault-tolerance techniques and implementations
  • Emerging hardware and software technology for resilience
  • Silent data corruption (SDC) detection / correction techniques
  • Advances in reliability monitoring, analysis, and control of highly complex systems
  • Failure prediction, error preemption, and recovery techniques
  • Fault-tolerant programming models
  • Models for software and hardware reliability
  • Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
  • Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
  • Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
  • Near-threshold-voltage implications and evaluations for reliability
  • Benchmarks and experimental environments including fault injection
  • Frameworks and APIs for fault-tolerance and fault management

Workshop Chairs

Nathan DeBardeleben - Los Alamos National Laboratory

Workshop Organizing Committee

Keita Teranishi – Sandia National Laboratories
Atsushi Hori – RIKEN AICS

Program Committee

Leonardo Bautista Gomez – Barcelona Supercomputing Center
Bogdan Nicolae – IBM Ireland
Aurélien Bouteiller – University of Tennessee Knoxville
Henri Casanova - University of Hawai`i at Manoa
Zizhong Chen – University of California, Riverside
Robert Clay – Sandia National Laboratories
John Daly - Department of Defense
James Elliott – Sandia National Laboratories
Christian Engelmann – Oak Ridge National Laboratory
Kurt Ferreira – Sandia National Laboratories
Qiang Guan – Los Alamos National Laboratory
Sudhanva Gurumurthi – IBM
Saurabh Hukerikar – Oak Ridge National Laboratory
Hideyuki Jitsumoto – Tokyo Institute of Technology
Zhiling Lan – Illinois Institute of Technology
Scott Levy – University of New Mexico
Naoya Maruyama – RIKEN AICS
Yves Robert - ENS Lyon
Anthony Skjellum – Auburn University
Vilas Sridharan – AMD, Inc.
Peter Strazdins – Australian National University
Abhinav Vishnu - Pacific Northwest National Laboratory