Welcome!

The EE HPC WG Workshop 2020 did occur on Monday, December 7th and Tuesday, December 8th starting at 15:00 UTC.

==================================

Registration

Registration was free and open to anyone interested. Register occurred in advance for both days of this event.

==================================

Slack Workspace Subscription Link

The workspace is intended for offline conversations before, during, and after the workshop. There are different channels for different topics covered during the event.

Please use your desired email and credentials for subscribing to this space: Link

==================================

Agenda

Dec 07, 7AM-9AM Pacific Time

· Introduction (10 minutes)

· Keynote: HPC Evolution of Power Distribution from Terascale to Exascale (20 minutes)

· Invited Speakers: Facility-Scale Efficiency (45 minutes)

· Panel: The “Hot Debate” of Liquid Cooling (45 minutes)

Dec 08, 7AM-9AM Pacific Time

· Introduction (5 minutes)

· Why Participate? EE HPC WG Overview (15 minutes)

· Invited Speakers: ARM Operational Experiences (45 minutes)

· Invited Speakers: Towards the Practical Use of Machine Learning in the HPC Data Center (45 minutes)

· Closing (10 minutes)

==================================

Keynote

"HPC Evolution of Power Distribution from Terascale to Exascale" given by Anna Maria Bailey of Lawrence Livermore National Laboratory

Recording: https://youtu.be/3u4DKnxNNU4?t=246

Anna Maria Bailey from Lawrence Livermore National Laboratory will tell us about her vast and long experience with electrical systems for HPC facilities. She will discuss how HPC electrical systems have evolved from commercial to industrial and are now transitioning yet again to utility-class endeavors.

Anna Maria Bailey is Lawrence Livermore National Laboratory’s (LLNL) HPC Chief Engineer. She holds a B.S. in Electrical Engineering from Cal Poly, San Luis Obispo and is registered with the California Board for Professional Engineers and Land Surveyors. She has more than 30 years of experience in multiple engineering roles at LLNL, notably serving as the design/construction manager for LLNL’s HPC Center, which houses some of the world’s most powerful supercomputers. In addition, she led the effort to earn LEED certification for two HPC facilities and is currently overseeing the planning of an Exascale facility modernization project to prepare for unprecedented Exascale facility infrastructure challenges. She also serves as the co-chair of the Energy Efficient HPC Working Group (EE HPC WG). Their mission is to reduce expenditure and curb environmental impact through increased efficiencies in HPC centers by encouraging the community to lead in energy efficiency as they do in computing performance.

==================================

Technical Sessions

Session 1: Facility-Scale Efficiency (Invited Speakers and Q&A)

Preview: Facility-Scale Efficiency

Slack Channel: Link to #wksp20-facility_scale_eff

Recording: https://youtu.be/3u4DKnxNNU4?t=1428

Description:

Presentation Titles and Speakers:

Moderator: Grant Stewart, Los Alamos National Laboratory

Organizers: Jason Hick, Los Alamos National Laboratory


Session 2: The “Hot Debate” of Liquid Cooling (Panel)

Preview: The “Hot Debate” of Liquid Cooling

Slack Channel: Link to #wksp20-liquid_cooling

Recording: https://youtu.be/3u4DKnxNNU4?t=4236

Description: Liquid cooling is common place in HPC these days. Operators of HPC systems love it as it facilitates high densities, free cooling, and heat reuse. To optimize their operations they want to remove as much heat as feasible at the highest possible water temperatures. From a vendor’s point of view, there are technical and economical limitations to both targets. So where is liquid cooling in future?

Panelists:

  • Bob Bolz, AQUILA Inc.

  • Rolf Brink, Asperitas

  • Dominik Dziarczykowski, CoolIT Systems

  • Steve Harrington, Chilldyne

  • Mani Prakash, Intel

  • Steven Dean, HPE

Moderator: Michael Ott from Leibniz Supercomputing Center

Organizers: Dale Sartor from Lawrence Berkeley National Laboratory, David Martinez from Sandia National Laboratory and David Grant from Oak Ridge National Laboratory

Day 2 Opening: EE HPC WG Update

Recording: https://youtu.be/AAbtSF3CxbU?t=197

Session 3: ARM Operational Experiences (Invited Speakers and Q&A)

Preview: ARM Operational Experiences

Slack Channel: Link to #wksp20-arm_operational_exp

Recording: https://youtu.be/AAbtSF3CxbU?t=1156

Description: In recent years, ARMv8 64bit architecture based systems are considered as competitive, viable alternatives for HPC. New functionalities such as ARMv8-A Scalable Vector Extension (SVE) can deliver the required performance for HPC workloads, and the hardware features for power & energy efficiency makes the ARM architecture an interesting choice in the data center. Also, substantial growth in usability of system software, ARM tailored HPC workloads, and the commercial availability of ARM nodes from HPC vendors demonstrate the viability for HPC. Acknowledging this trend with ARM, we have invited a panel of three speakers from sites that have deployed ARM architecture based HPC systems (Isambard (GW4 alliance, UK), Astra (DOE/NNSA, US), and Fugaku (RIKEN, Japan). With this panel, we will explore their lessons learned throughout the design, deployment and operations of these ARM HPC systems.

Presentation Titles, Speakers and Presentation Abstracts:

  • "ARM Operational Experiences - Isambard 1 & 2" by Simon McIntosh-Smith from the University of Bristol, U.K.. Isambard is the world’s first production ARMv8-based supercomputer. Recently upgraded to Isambard 2, the system includes 21,000+ ARMv8 cores and 72 nodes of Fujitsu’s A64fx processor, the ARM CPU in Fugaku, the #1 system in the current Top500. Isambard 2 also provides access to other mainstream and emerging architectures, enabling ‘apples-to-apples’ comparisons using the same software stack all within one system. By presenting the Isambard 2 ARM evaluation results that leverages the unique opportunity of multi-architecture access, this talk will present an architectural comparison between Fujitsu A64fx and Marvell ThunderX2 ARM-based CPUs and other architectures, highlighting the impact of ARM-architecture for high-performance computing.

  • "Chronicles of Astra: Challenges and Lessons from the First Petascale Arm Supercomputer" by Kevin Pedretti from Sandia National Laboratory. Astra is the first Petascale supercomputer based on 64-bit ARM processors. This talk will present the lessons learned from the process of bringing up Astra, and validating the ability to run HPC applications. Especially, the talk will highlight the experiences in deploying and tuning the Astra supercomputer for performance & energy efficiency.

  • "Early Operational Experiences on Fugaku" by Fumiyoshi Shoji from RIKEN. Fugaku is an arm-based large-scale supercomputer developed by RIKEN and Fujitsu and consists of more than 150k computing nodes. This talk will discuss the early operational experiences on Fugaku, the ARM-based large-scale supercomputer developed by RIKEN and Fujitsu. In particular, the speaker will introduce the energy-saving functions of the Fugaku system and its effectiveness and will report how RIKEN will take advantage of the functions to achieve energy-efficient operations.

Moderator: Woong Shin, Oak Ridge National Laboratory

Organizers: James Laros from Sandia National Laboratory, Oscar Hernandez from Oak Ridge National Laboratory and Ross Miller from Oak Ridge National Laboratory


Session 4: Towards the Practical Use of Machine Learning in the HPC Data Center (Invited Speakers and Q&A)

Preview: Towards the Practical Use of Machine Learning in the HPC Data Center

Slack Channel: Link to #wksp20-machine_learning_uses

Recording: https://youtu.be/AAbtSF3CxbU?t=3965

Description: This session will discuss using ML in the data center. It will highlight different approaches and will go into more details by focusing on some concrete use-cases and solutions using ML.

Presentation Titles and Speakers:

  • "Machine Learning for HPC Data Centers to Improve Energy Efficiency and Resiliency" by Sergey Serebryakov from Hewlett Packard Enterprise

  • "Learning Energy-Aware Management of Modern Data Centers" by Hayk Shoukourian from the Leibniz Supercomputing Center. Power consumption continues to remain one of the major challenges for deploying and operating next generation pre-Exascale and Exascale supercomputers. Apart from operational costs, the next generation High Performance Computing (HPC) systems, as major power consumers, are projected to affect the stability of underlying power grid. That in turn requires certain control mechanisms that will ensure data center’s adherence to the predefined power envelopes and will continuously match the actual behavior of the target computing infrastructure to the current availability of power and cooling. The talk will present a machine learning based framework, referred to as Infrastructure Data Analyzer and Forecaster (IDAF), allowing the forecast of various energy/power consumption relevant, data-center level, Key Performance Indicators. The talk will outline the use cases of the suggested framework by real-time control mechanisms for energy-efficiency improvements in modern data centers.

  • "Machine Learning-Based Performance Analytics on High-Performance Computing Systems" by Burak Aksar from Boston University

Moderator: Torsten Wilde, HPE

Organizer: Natalie Bates, EE HPC WG

==================================

Speaker Photo and Bio