Special Events‎ > ‎

SC16 BOF

Abstract: This BOF addresses critical issues in large-scale monitoring from the perspectives of worldwide HPC center system administrators, users, and vendors. This year will be 100% facilitated audience interactive discussion on tools, techniques, experiences, and gaps in understanding, diagnosing, and attributing causes behind performance variation and poor performance. Causes include contention for shared network and I/O resources and system component problems. Our goal is to facilitate enhancement of community monitoring and analysis capabilities by identifying useful tools and techniques and encouraging the development of quickstart guides for these tools to be posted at the community web site: https://sites.google.com/site/monitoringlargescalehpcsystems/


This year will be 100% facilitated audience interactive discussion on tools, techniques, experiences, and gaps in understanding, diagnosing, and attributing causes behind performance variation and poor performance.

---> LIVE NOTES during the BoF google doc: Archival version 
---> Signup HERE to lead a follow up discussion or write quick start guides
---> MAILING LIST INFO: HERE

It would be helpful to the community to write QuickStart and/or recipes for getting started with tools/technologies that each of us are using. This would lower the entry barrier for trying out a new capability. The tools themselves would in many cases be vendor-agnostic.

Please add such information to the list below. Links to docs are preferred over uploads.

Suggested info:

1) For capabilities/information people are requesting:

  • What do you want this tool for? Types of problems you are trying to diagnose? Types of situations you are trying to be informed about or understand?
  • Requirements/Constraints for installation and operation? (e.g., requires web access, requires X, system requirements)
2) For tools people are using (could write Quick Start guides on these!):
  • Tool name
  • Where to obtain
  • For what do you use this tool? Types of problems to diagnosis? Types of situations of which to be informed or understand?
  • Requirements/Constraints for installation and operation (e.g., requires web access, requires X, system requirements)
  • Actual use cases, including screen shots and output data.


ADD/EDIT HERE! FILES CAN BE ALSO BE ADDED BELOW!

Capabilities/Information requested:

1) Analysis capabilities (pandas, spark)

  • There are a lot of new analysis capabilities, like the SciPy tools, Spark, etc. that I would like more info on:
    • When are these useful
      • What features? What types of data? What types of analyses?
    • Do I need to have a lot of data or distributed data stores?
    • How hard are they to use or write new analysis?

2) Application Impact

  • Analyses and data to assess application impact due to system events, system load, and interfering applications:
    • What types of data should I collect?
    • How can I determine if an application has been impacted given variable production load? And by how much?
    • How do I do attribution of the source of events?

3) I/O

  • I want to know about existing tools/techniques for getting information on I/O, including:
    • Do I need more system resources to satisfy the demand?
    • How can I know what applications are hammering the system?
    • How can applications be modified to improve performance?

4) Network contention

  • I want to know about existing tools/techniques for understanding: 
    • Contention in the network
    • Whether or not network conditions (contention, available bandwidth) are impacting application performance?
    • What applications are responsible for the contention?

5) Data stores

  • I want to know about options for storing text and numeric log data, including:
    • Performance for handling large amounts of data
    • When should I worry about in-memory vs. on-disk?
    • What tools exist that work with the stored for visualization and analysis?
    • How easy/hard to set up? 
    • Best practices for configuration

6) Visualization and dashboard tools

  • I want to know about tools that exist with which I can use to easily make dashboards:
    • How much coding is involved to use them?
    • Are there access control features?
    • Can I share a dashboard I created with someone else?
    • What kinds of plotting does it support?
    • What is the performance for large number of nodes or long periods of time?


Tools people are using:

1) XD Metrics on Demand (XDMoD) -- Open source tool to measure system performance from the user perspective

  • XDMoD QuickStart guide here
2) TACC Stats -- Open source infrastructure for the low-overhead collection and aggregation of system-wide HPC performance data
  • TACC Stats QuickStart guide here
3) OVIS suite of HPC monitoring tools
  • Lightweight Distributed Metric Service (LDMS) -- Open source tool for lightweight collection, transport, analysis, and storage of data from HPC systems
    • LDMS QuickStart guide here
  • Baler -- Open source tools for log message classification, visualization, and association rule mining
    • Baler QuickStart guide here