Program
HPCMASPA Workshop - May 27, 2016
Meeting will consist of a Keynote speaker, Full paper presentations (30 min), Short paper presentations (20 min), and an interactive Panel discussion
08:40 - 10:00 SESSION 1: KEYNOTE
8:40 - 9:00 Introduction Gentile
9:00 - 10:00 Keynote
Director and PI of the NCSA Blue Waters Project, Director of the UIUC/NCSA @Scale Program office, and Research Professor in the UIUC Computer Science Department
Failure and Resiliency in the Shadow of Exascale – Will our our Current Assumptions Take us in the Right Direction?
Abstract: Today’s complex, extreme scale systems are prone to many modes of failure, but which modes are having the greatest impact may not be as clear as we think. Using data from Blue Waters and other large scale systems, this talk highlights the types of failures that have the greatest impacts on the effectiveness of today’s systems and applications. As an example, we see that software related failures have a greater impact on productivity than hardware failures. This talk will also discuss how performance issues become failures and what this means for the future where consistent performance may become less likely. Finally the talk will touch on some current projects that are taking new looks at reliability and resiliency such as the Holistic Measurement Driven Resiliency effort.
10:00 - 10:30 BREAK
10:30 - 12:00 SESSION 2: INSTRUMENTATION AND METRICS
Moderator: Lueninghoener
10:30-11:00 - Full paper - Call Tree Controlled Instrumentation for Low-Overhead Survey Measurements. C. Iwainsky and C. Bischof
11:00-11:30 - Full paper - Automatically Instrumenting Scientific Applications to Produce Heartbeat Events. M. Tanash, N. Ghazanfari, O. Aaziz, and J. Cook
11:30-11:50 - Short paper - Defining Metrics to Distill Large-scale HPC Platform and Application Performance Data into Actionable Quantities. A. Agelastos
11:50 - 12:00 Session group author Q&A
12:00 - 13:30 LUNCH
13:30 - 15:30 SESSION 3: MONITORING SYSTEMS
Moderator: Lueninghoener
14:00-14:30 - Full paper - Understanding Application and System Performance Through System-wide Monitoring. T. Evans, J. Browne, and W. Barth
14:30-15:00 - Full paper - Large-scale Persistent Monitoring System Experiences. J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer
15:00-15:20 - Short paper - Design and implementation of a Scalable HPC Monitoring System. S. Sanchez, A. Bonnie, G. Van Huele, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt
15:20-15:30. Session group author Q&A
15:30 - 16:00 BREAK
16:00 - 17:00 SESSION 4: INTERACTIVE PANEL DISCUSSION
Moderator: Brandt
Accessible Analytics and Visualizations
Understanding and diagnosis of system problems and performance issues relies on complex analytics. The interplay of system components and applications competing for the same resources in a dynamic environment is not well understood. While analyses that rely on multivariate data with different timescales of change and visualizations to capture high dimensional data (e.g., network) are being developed, the complexity of these makes the extraction of actionable information by system administrators and users difficult. Assessing the validity of even simple statistical analysis results still requires significant domain knowledge. This panel will discuss experiences, impediments, and directions in the development of Analytics and Visualization techniques for enhancing HPC platforms operations.
Panelists: Todd Evans (TACC), Todd Gamblin (LLNL), Cory Lueninghoener (LANL)