Agenda

HPCMASPA will be held 6:00 am - 10:30 am PACIFIC on Sept 14, 2020. Advance registration is required for both the conference (free) and for the Zoom link.

Session 1:

  • 6:00 - 6:15 Introduction

  • 6:15 - 6:45 (Full) - PIKA: Center-Wide and Job-Aware Cluster Monitoring, R. Dietrich, F. Winkler, A. Knuepfer, and W. Nagel. Technische Universitat Dresden, GER

  • 6:45 - 7:15 (Full) - HPC System Data Pipeline to Enable Meaningful Insights through Analytic-Driven Visualizations, B. Schwaller, N. Tucker, T. Tucker, B. Allan, J. Brandt. Sandia National Laboratories, USA and Open Grid Computing, USA

  • 7:15 - 7:45 (Full) - MAP: A Visual Analytics System for Job Monitoring and Analysis, A. Pal and P. Malakar. IIT Kanpur, India

Break: 7:45 - 8:00

Session 2:

  • 8:00 - 8:20 (Short) - Towards Workload-Adaptive Scheduling for HPC Clusters, A. Goponenko, R. Izadpanah, J. Brandt, and D. Dechev, University of Central Florida, USA and Sandia National Laboratories, USA

  • 8:20 - 8:40 (Short) - Democratizing Parallel Filesystem Monitoring, R. Evans, Texas Advanced Computing Center, USA

  • 8:40 - 9:00 (Short) - LDMS Monitoring of EDR InfiniBand Networks, B. Allan, M. Aguilar, B. Schwaller, and S. Langer, Sandia National Laboratories, USA and Lawrence Livermore National Laboratories, USA.

Panel: 9:00 - 10:30

  • Enabling ML approaches to HPC Systems Operations - The goal of the panel is to advance the state of the field by discussing the current state of HPC system operational monitoring and how it can be improved by applying data science and machine learning techniques to find hidden anomalies, debug known system issues, and investigate interesting or surprising events. Our hope is that this panel can help both the HPC operations and the data science communities: on the HPC operations side, by giving attendees a better understanding of available algorithms and techniques; and on the data science side by making connections with the many sources of data that are available.

    • Abdullah Mueen (UNM) is an Associate Professor in Computer Science at the University of New Mexico. His interest is in data mining and machine learning algorithms for temporal data such as time series and event sequences. His research spans across seismic data processing, social media analysis and smart grid data analysis.

    • Prasanna Balaprakash (ANL) Prasanna Balaprakash is a computer scientist at the Mathematics and Computer Science Division with a joint appointment in the Leadership Computing Facility at Argonne National Laboratory. Currently, his research focuses on the development of scalable, data-efficient machine learning methods for scientific applications. He is a recipient of U.S. Department of Energy 2018 Early Career Award. He is the machine-learning team lead and data-understanding team co-lead in RAPIDS, the SciDAC Computer Science institute. Prior to Argonne, he worked as a Chief Technology Officer at Mentis Sprl, a machine learning startup in Brussels, Belgium.

    • Jim Brandt (SNL) is a Distinguished Member of Technical Staff at Sandia National Laboratories. He is the lead of the R&D 100 Award Winning Lightweight Distributed Metric Service (LDMS) which is deployed for extreme-scale monitoring at large-scale HPC sites within NNSA, Office of Science, and NSF sites, as well as others internationally. Jim has extensive collaborations on the use of statistical and ML-based analytics on high fidelity HPC monitoring data to enable more effective use of HPC systems via detection of performance-impacting competition for shared resources, discovery of abnormal system conditions, and intelligent response to conditions of interest.

    • Nick Brown (EPCC) is a research fellow at EPCC, University of Edinburgh, with research interests in HPC and data science. He is a work package leader in VESTEC, an EU FET HPC project, which aims to fuse HPC with real-time data for urgent decision making, and it is this project which drives his interests in ML approaches for optimising machine efficiency in terms of resource utilization, and application run-time. More specifically, there are numerous machines which our technology federates over to run urgent workloads, and as such it is critically important to make accurate decisions when it comes to running what codes, where. Nick has also applied ML to other domains, including leading a project that successfully optimised oil & gas exploration using ML, and runs an MSc course in Edinburgh as well as supervising PhD and MSc students.

    • Cory Lueninghoener (LANL) leads the HPC Platforms Design team at Los Alamos National Laboratory. He splits his time between large-scale system software research and operation, looking for ways to increase automation and visibility into HPC systems and pushing those methods into production on large HPC resources.