Special Events‎ > ‎

SC14 BOF: Monitoring Large-Scale HPC Systems

NOTE: This event is past. Please do not modify this page

Abstract

This BOF addresses critical issues and approaches in large-scale HPC monitoring from the perspectives of system administrators, users, and vendors. In particular we target capabilities, gaps, and roadblocks in monitoring as we move to extreme scales including: a) desired information, b) vendor and tool-enabled interfaces to data, c) integration of capabilities that provide and respond to data (e.g., integrated adaptive runtimes, application feedback), d) Monitoring impact analysis methods for large scale applications, and e) other hot topics (e.g., power, network congestion, reliability, high-density components). A panel of large-scale HPC stakeholders will interact with BoF attendees on topics of interest.


About
Interact with large-scale HPC stakeholders and discuss problems, solutions, and experiences in monitoring at scale!

Schedule
Wed Nov 19 @ 5:30pm-7pm at SC14

Panelists and Topics of Discussion (Please add name, affiliation, problem and a few sentence description suitable for the report):
  • Mike Showerman, NCSA. 
    • Problem to focus on is the impact of data collection on large scale apolication performance
    • Flexible analysis of large datasets
  • Steven Martin, Cray Inc.
    • Steven is a member of the Cray Supercomputing Products Software Architecture group with a focuses on system monitoring and power management.
    • In-Band vs. Out-Of-Band data collection
    • Best practices in making collected data available to the sites favorite management dashboards and monitoring frameworks, where many sites are already invested in a solution.
  • Larry Pezzaglia, NERSC/LBNL:
    • Larry Pezzaglia is an HPC Systems Analyst at NERSC at Lawrence Berkeley National Laboratory.  In addition to managing NERSC's production computational systems, he develops software to facilitate systems administration at scale.  Larry is the System Integration Lead for Cori, NERSC's next flagship supercomputer.
    • Integrating Multiple Monitoring Solutions: Modern HPC centers deploy multiple monitoring solutions to collect, store, and visualize monitoring data from many sources, The resulting trend, log, and event information differs in format, delivery mechanism, and precision.  Handling the subtleties of each monitoring package and integrating the resulting information is difficult, time-consuming, and becomes increasingly laborious as system scale increases. A future, robust, center-wide data handling service, which would abstract away the details of multiple monitoring solutions and provide easy interfaces for storing and querying the data, could improve this situation.
  • Mike Mason, LANL: Mike is the Technical Lead for HPC Monitoring and a member of the File Systems Team for the HPC production group at LANL.
    • Event Correlation
      • With all of our logs from the clusters, networks and file systems how do we view everything as an interconnected systems?  How do we correlate seemingly separate events to understand our entire HPC infrastructure as a whole systems?
    • Data Categorization
      • How do we categorize and recognize the data we are collecting, the rate of data we need and what data we are missing?  How do we put this information in to a format that allows us to understand what questions we can answer with the given data?
    • Alternative Stack Monitoring
      • Our current HPC systems are clusters of RHEL nodes running a single jobs.  As our clusters move to Big Data, Hadoop or more cloud like HPC systems and our networks and file systems become more dynamic and software defined, how can we incorporate all of the new and dynamic information into our existing monitoring infrastructure?
  • Bill Barth, TACC: Director of HPC. PI for SUPReMM/TACC Stats HPC analytics project.
    • How do we identify problematic performance using coarse-grained, low-impact metrics for jobs on HPC systems?
    • How do we handle having more jobs/problems/correlations identified than we have staff to work on?
  • Devesh Tiwari, Oak Ridge Leadership Computing Facility, ORNL: Research Staff Member (Computer System Architect), with special focus on power-efficiency and resilience issues.
    • How can applications benefit from large-scale monitoring infrastructure? What are some of the challenges and opportunities?
    • What information should we make available to end users, in what format, and  at what abstraction-level?
    • What kind of lightweight analysis can be performed on the collected data to improve application performance, resilience and system throughput?
  • William (Bill) E. Allcock  is the Director of Operations and the Manager of the Advanced Integration Group for the Argonne Leadership Computing Facility.  As such, his primary responsibility is the overall operations of the systems including the end-to-end integration of the software stack.
    • He has this crazy dream sometimes.  In this dream, all the systems we buy use a common and flexible way of exposing information about the state of the system (monitoring and more).  There are standard procedures for all the typical things you do to a system and they are automatically recorded via this same mechanism.  This data is persisted in some common place and way so that you can trivially answer questions about the current state and/or history of the system.  Bill would very much like to make this dream a reality and wonders if others agree or think he is some kind of raving lunatic? This summarizes to the following discussion topics:
      • The use of a common approach for exposing all system state that is flexible and robust
      • A common approach for persisting that state and querying current and past state of the system
      • Possibly implications, good and bad, of such a system
Ċ
Ann Gentile,
Nov 23, 2014, 8:09 PM
Ċ
Ann Gentile,
Nov 19, 2014, 2:24 PM
Ċ
Ann Gentile,
Nov 19, 2014, 2:24 PM
Ċ
Ann Gentile,
Nov 19, 2014, 12:48 PM
Ċ
Ann Gentile,
Nov 19, 2014, 12:47 PM
Ċ
Ann Gentile,
Nov 19, 2014, 12:47 PM