Program

HPCMASPA Workshop - May 27, 2016

Meeting will consist of a Keynote speaker, Full paper presentations (30 min), Short paper presentations (20 min), and an interactive Panel discussion

08:40 - 10:00 SESSION 1: KEYNOTE

8:40 - 9:00 Introduction Gentile

9:00 - 10:00 Keynote

William (Bill) T. C. Kramer

Director and PI of the NCSA Blue Waters Project, Director of the UIUC/NCSA @Scale Program office, and Research Professor in the UIUC Computer Science Department

Failure and Resiliency in the Shadow of Exascale – Will our our Current Assumptions Take us in the Right Direction?

Abstract: Today’s complex, extreme scale systems are prone to many modes of failure, but which modes are having the greatest impact may not be as clear as we think. Using data from Blue Waters and other large scale systems, this talk highlights the types of failures that have the greatest impacts on the effectiveness of today’s systems and applications. As an example, we see that software related failures have a greater impact on productivity than hardware failures. This talk will also discuss how performance issues become failures and what this means for the future where consistent performance may become less likely. Finally the talk will touch on some current projects that are taking new looks at reliability and resiliency such as the Holistic Measurement Driven Resiliency effort.

10:00 - 10:30 BREAK

10:30 - 12:00 SESSION 2: INSTRUMENTATION AND METRICS

Moderator: Lueninghoener

10:30-11:00 - Full paper - Call Tree Controlled Instrumentation for Low-Overhead Survey Measurements. C. Iwainsky and C. Bischof

11:00-11:30 - Full paper - Automatically Instrumenting Scientific Applications to Produce Heartbeat Events. M. Tanash, N. Ghazanfari, O. Aaziz, and J. Cook

11:30-11:50 - Short paper - Defining Metrics to Distill Large-scale HPC Platform and Application Performance Data into Actionable Quantities. A. Agelastos

11:50 - 12:00 Session group author Q&A

12:00 - 13:30 LUNCH

13:30 - 15:30 SESSION 3: MONITORING SYSTEMS

Moderator: Lueninghoener

14:00-14:30 - Full paper - Understanding Application and System Performance Through System-wide Monitoring. T. Evans, J. Browne, and W. Barth

14:30-15:00 - Full paper - Large-scale Persistent Monitoring System Experiences. J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer

15:00-15:20 - Short paper - Design and implementation of a Scalable HPC Monitoring System. S. Sanchez, A. Bonnie, G. Van Huele, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt

15:20-15:30. Session group author Q&A

15:30 - 16:00 BREAK

16:00 - 17:00 SESSION 4: INTERACTIVE PANEL DISCUSSION

Moderator: Brandt

Accessible Analytics and Visualizations

Understanding and diagnosis of system problems and performance issues relies on complex analytics. The interplay of system components and applications competing for the same resources in a dynamic environment is not well understood. While analyses that rely on multivariate data with different timescales of change and visualizations to capture high dimensional data (e.g., network) are being developed, the complexity of these makes the extraction of actionable information by system administrators and users difficult. Assessing the validity of even simple statistical analysis results still requires significant domain knowledge. This panel will discuss experiences, impediments, and directions in the development of Analytics and Visualization techniques for enhancing HPC platforms operations.​

Panelists: Todd Evans (TACC), Todd Gamblin (LLNL), Cory Lueninghoener (LANL)