Advanced Micro Devices (AMD) is sponsoring the first FTXS best paper award!  This award will be chosen by the PC and awarded at FTXS 2015.  The prize is a new PlayStation 4!

The FTXS 2015 best paper was awarded to Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices by Brian Austin, Eric Roman and Xiaoye Li.  The runner up was A Principled Approach to HPC Event Monitoring by Alireza Goudarzi, Dorian Arnold, Darko Stefanovic, Kurt Ferreira and Guy Feldman.  Brian Austin donated the Playstation 4 to runner up presenter, Alireza Goudarzi.

FTXS 2015 in Portland, OR

The 5th Fault Tolerance for HPC at eXtreme Scale (FTXS) 2015

Workshop Keynote

Failures in Large-Scale Systems: Insights from the Field
Sudhanva Gurumurthi
AMD Research, Advanced Micro Devices, Inc.
University of Virginia

The use of highly scaled technologies and large component counts pose significant reliability challenges for large-scale systems. Knowledge of failures that occur in such systems is valuable for driving RAS design decisions for component and system vendors, as well as for the operators of those systems in order to improve resilience. Field studies play a key role in providing insights into the types of failures that occur in real systems, especially at scale. This talk will highlight the value of such studies and discuss implications for future exascale systems using data from failure analyses of supercomputers and cloud data centers.

Sudhanva Gurumurthi is a Senior Researcher at AMD, where he directs projects on resiliency and reliability. He used to be a tenured Associate Professor in the Computer Science Department at the University of Virginia and is currently a Visiting Associate Professor in that department. Sudhanva is a recipient of the NSF CAREER Award and several research awards from the NSF and industry. He received his PhD from Penn State, and is a Senior Member of the IEEE and the ACM.

Workshop Agenda

9 high quality papers plus the Federated Computing Research Conference plenary talk makes for a long day for FTXS 2015!



Authors, Speaker, or Moderator

9:00am Welcome and opening remarks Nathan DeBardeleben
 SESSION 1: KeynoteModerator: TBD 
9:15amKeynote: Failures in Large-Scale Systems: Insights from the Field (slides)Sudhanva Gurumurthi, AMD and UVA
 SESSION 2: Logging and MonitoringModerator: TBD


A Principled Approach to HPC Event Monitoring (slides)

Alireza Goudarzi, Dorian Arnold, Darko Stefanovic, Kurt Ferreira and Guy Feldman

LogDiver: A Tool for Measuring Resilience of Extreme - Scale Systems and Applications (slides)

Catello Di Martino, Saurabh Jha, Zbigniew Kalbarczyk, William Kramer and Ravishankar Iyer
11:00amBreak Break 
11:20amFCRC Plenary Session  
12:30pmBreak  Lunch
 SESSION 3: Resilient Algorithms and Libraries Moderator: TBD 


Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices (slides)

Brian Austin, Eric Roman and Xiaoye Li

Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors (slides)

Aurélien Cavelan, Yves Robert, Hongyang Sun and Frédéric Vivien

Empirical Studies of the Soft Error Susceptibility of Sorting Algorithms to Statistical Fault Injection (slides)

Qiang Guan, Nathan DeBardeleben, Sean Blanchard and Song Fu
3:30pmEvolving the message passing programming model via a fault-tolerant, object-oriented transport layer (slides)
Jeremiah Wilke, Janine Bennett, Keita Teranishi, Hemanth Kolla, David Hollman and Nicole Slattengren
4:00pm BreakBreak
 SESSION 4: Other Topics Moderator: TBD 

How Much SSD Is Useful for Resilience in Supercomputers (slides)

Aiman Fang and Andrew Chien

The Path to Exascale: Code Optimizations and Hardening Solutions Reliability planning (slides)

Daniel Alfonso Gonçalves De Oliveira, Laercio Pilla, Caio Lunardi, Luigi Carro, Philippe Navaux and Paolo Rech
5:20pm Transient Fault Resilient QR Factorization on GPUs (slides)Felix Loh, Parameswaran Ramanathan and Kewal Saluja
5:50pmClosing Remarks, Best Paper AwardFTXS 2014 Organizers 

Submission Essential Information

Submissions are expected in the following categories:
  • Regular papers presenting innovative ideas improving the state of the art
  • Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
  • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results
Authors are invited to submit papers with unpublished, original work of a maximum of eight (8) pages for normal papers and six (6) pages for position papers. Please follow the US Letter guidelines for ACM Proceedings Style. 

Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library. Submission implies the willingness of at least one of the authors to register and present the paper.

Important Dates

Submission of papers: February 9th, 2015 February 17th, 2015 (EXTENDED, NOW FIRM DEADLINE)
Author notification: March 9th, 2015 (this date may shift due to submission extension)
Camera ready papers: April (TBA), 2015
Workshop: June 15th, 2015

Workshop Topics

Topics include, but are not limited to:
  • Failure data analysis and field studies
  • Power, performance, resilience (PPR) assessments / tradeoffs
  • Novel fault-tolerance techniques and implementations
  • Emerging hardware and software technology for resilience
  • Silent data corruption (SDC) detection / correction techniques
  • Advances in reliability monitoring, analysis, and control of highly complex systems
  • Failure prediction, error preemption, and recovery techniques
  • Fault-tolerant programming models
  • Models for software and hardware reliability
  • Metrics and standards for measuring, improving, and enforcing effective fault-tolerance
  • Scalable Byzantine fault-tolerance and security from single-fault and fail-silent violations
  • Atmospheric evaluations relevant to HPC systems (terrestrial neutrons, temperature, voltage, etc.)
  • Near-threshold-voltage implications and evaluations for reliability
  • Benchmarks and experimental environments including fault injection
  • Frameworks and APIs for fault-tolerance and fault management

Workshop Organizers / Program Chairs

Nathan DeBardeleben - Los Alamos National Laboratory
Franck Cappello - Argonne National Laboratory and the University of Illinois at Urbana Champaign
Robert Clay - Sandia National Laboratories

Program Committee

Leonardo Bautista Gomez – Argonne National Laboratory
Aurélien Bouteiller – University of Tennessee Knoxville
Greg Bronevetsky - Lawrence Livermore National Laboratory
John Daly - Department of Defense
Christian Engelmann – Oak Ridge National Laboratory
Kurt Ferreira – Sandia National Laboratories
Ana Gainaru – University of Illinois at Urbana-Champaign
Qiang Guan – Los Alamos National Laboratory
Saurabh Gupta – Oak Ridge National Laboratory
Saurabh Hukerikar – Information Sciences Institute/USC
Hideyuki Jitsumoto – Tokyo Institute of Technology
Zhiling Lan – Illinois Institute of Technology
Scot Levy – University of New Mexico
Naoya Maruyama – RIKEN AICS
Bogdan Nicolae – IBM Research – Ireland
Thomas Ropars - EPFL
Yves Robert - ENS Lyon
Anthony Skjellum - Auburn University
Vilas Sridharan – AMD, Inc.
Devesh Tiwari – Oak Ridge National Laboratory
Abhinav Vishnu - Pacific Northwest National Laboratory