FTXS 2010 - Chicago, IL

The 1st Fault Tolerance for HPC at eXtreme Scale (FTXS) 2010

SUBMISSION DEADLINE EXTENDED: March 23, 2010 - 11:59 PM EST (FINAL DEADLINE EXTENSION!)

Workshop Agenda

Below (slides) and (REFerences) are linked where available.

Objectives and Challenges

With the emergence of many-core processors, accelerators, and alternative/heterogeneous architectures, the HPC community faces a new challenge: a scaling in number of processing elements that supersedes the historical trend of scaling in processor frequencies. The attendant increase in system complexity has first-order implications for fault tolerance. Mounting evidence invalidates traditional assumptions of HPC fault tolerance: faults are increasingly multiple-point instead of single-point and interdependent instead of independent; silent failures and silent data corruption are no longer rare enough to discount; stabilization time consumes a larger fraction of useful system lifetime, with failure rates projected to exceed one per hour on the largest systems; and application interrupt rates are apparently diverging from system failure rates.

The workshop will convene a diverse group of experts in HPC and fault-tolerance to inaugurate a fault-tolerance research agenda for responding to the unique challenges that extreme scale and complexity. Innovation is encouraged and discussion of non-traditional approaches is welcome.

Submission Essential Information

Submissions are expected in the following categories:

  • Regular papers presenting innovative ideas improving the state of the art

  • Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation

  • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results

SUBMISSIONS ARE CLOSED.

Important Dates

Submission of papers: March 23, 2010 - 11:59 PM EST (DEADLINE EXTENDED!)

Author notification: April 9, 2010

Camera ready papers: April 30, 2010

Workshop: June 28, 2010

Workshop Topics

Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:

      • Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information

      • Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications

      • Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery

      • Advances in monitoring, analysis, and control of highly complex systems

      • Highly scalable fault-tolerant programming models

      • Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance

      • Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems

      • Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations

      • Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress

Workshop Organizers

John Daly, Center for Exceptional Computing / Department of Defense, USA (Co-Chair)

Nathan DeBardeleben, Center for Exceptional Computing / Department of Defense, USA (Co-Chair)

Program Committee

Greg Bronevetsky, Lawrence Livermore National Laboratory, USA

Franck Cappello, INRIA, France

Daniel Katz, University of Chicago, USA

Armando Fox, University of California, USA

Zbigniew Kalbarczyk, University of Illinois, USA

Yasunori Kimura, Fujitsu Laboratories, Japan

Sébastien Monnet, University of Pierre and Marie Curie, France

Takashi Nanya, University of Tokyo, Japan

Nuno Neves, University of Lisbon, Portugal

Stephen Scott, Oak Ridge National Laboratory, USA

Marc Snir, University of Illinois, USA

Jon Stearley, Sandia National Laboratory, USA

Kishor Trivedi, Duke University, USA

Questions?

Please address FTXS workshop questions to Nathan DeBardeleben, Los Alamos National Laboratory (ndebard@lanl.gov)