EE HPC WG Workshop - Workshop Program

Welcome!

The EE HPC WG Workshop 2020 did occur on Monday, December 7th and Tuesday, December 8th starting at 15:00 UTC.

==================================

Registration

Registration was free and open to anyone interested. Register occurred in advance for both days of this event.

==================================

Slack Workspace Subscription Link

The workspace is intended for offline conversations before, during, and after the workshop. There are different channels for different topics covered during the event.

Please use your desired email and credentials for subscribing to this space: Link

==================================

Agenda

Dec 07, 7AM-9AM Pacific Time

· Introduction (10 minutes)

· Keynote: HPC Evolution of Power Distribution from Terascale to Exascale (20 minutes)

· Invited Speakers: Facility-Scale Efficiency (45 minutes)

· Panel: The “Hot Debate” of Liquid Cooling (45 minutes)

Dec 08, 7AM-9AM Pacific Time

· Introduction (5 minutes)

· Why Participate? EE HPC WG Overview (15 minutes)

· Invited Speakers: ARM Operational Experiences (45 minutes)

· Invited Speakers: Towards the Practical Use of Machine Learning in the HPC Data Center (45 minutes)

· Closing (10 minutes)

==================================

Keynote

"HPC Evolution of Power Distribution from Terascale to Exascale" given by Anna Maria Bailey of Lawrence Livermore National Laboratory

Recording: https://youtu.be/3u4DKnxNNU4?t=246

Anna Maria Bailey from Lawrence Livermore National Laboratory will tell us about her vast and long experience with electrical systems for HPC facilities. She will discuss how HPC electrical systems have evolved from commercial to industrial and are now transitioning yet again to utility-class endeavors.

Anna Maria Bailey is Lawrence Livermore National Laboratory’s (LLNL) HPC Chief Engineer. She holds a B.S. in Electrical Engineering from Cal Poly, San Luis Obispo and is registered with the California Board for Professional Engineers and Land Surveyors. She has more than 30 years of experience in multiple engineering roles at LLNL, notably serving as the design/construction manager for LLNL’s HPC Center, which houses some of the world’s most powerful supercomputers. In addition, she led the effort to earn LEED certification for two HPC facilities and is currently overseeing the planning of an Exascale facility modernization project to prepare for unprecedented Exascale facility infrastructure challenges. She also serves as the co-chair of the Energy Efficient HPC Working Group (EE HPC WG). Their mission is to reduce expenditure and curb environmental impact through increased efficiencies in HPC centers by encouraging the community to lead in energy efficiency as they do in computing performance.

==================================

Technical Sessions

Session 1: Facility-Scale Efficiency (Invited Speakers and Q&A)

Preview: Facility-Scale Efficiency

Slack Channel: Link to #wksp20-facility_scale_eff

Recording: https://youtu.be/3u4DKnxNNU4?t=1428

Description:

Presentation Titles and Speakers:

"Using SkySpark for Ongoing Commissioning at Lawrence Berkeley National Laboratory" by Sadie Joy from kW Engineering. After implementing multiple energy efficiency projects at NERSC, the Sustainable Berkeley Lab team used the analytics platform SkySpark to develop a protocol for ongoing commissioning. Using a variety of metrics and graphical views, the team has been able to identify setpoint adjustments, find energy anomalies, and look for new savings opportunities. This session will review the process for ongoing commissioning at NERSC and highlight the advantages of using an analytics tool like SkySpark in our work.
"Optimizing Data Center Operations with Intelligent Monitoring Systems" by Karl Kersey from Schneider Electric.
"Developments in Power Forecasting for HPC Operations" by Joshi Fullop from Los Alamos National Laboratory.

Moderator: Grant Stewart, Los Alamos National Laboratory

Organizers: Jason Hick, Los Alamos National Laboratory

Session 2: The “Hot Debate” of Liquid Cooling (Panel)

Preview: The “Hot Debate” of Liquid Cooling

Slack Channel: Link to #wksp20-liquid_cooling

Recording: https://youtu.be/3u4DKnxNNU4?t=4236

Description: Liquid cooling is common place in HPC these days. Operators of HPC systems love it as it facilitates high densities, free cooling, and heat reuse. To optimize their operations they want to remove as much heat as feasible at the highest possible water temperatures. From a vendor’s point of view, there are technical and economical limitations to both targets. So where is liquid cooling in future?

Panelists:

Bob Bolz, AQUILA Inc.
Rolf Brink, Asperitas
Dominik Dziarczykowski, CoolIT Systems
Steve Harrington, Chilldyne
Mani Prakash, Intel
Steven Dean, HPE

Moderator: Michael Ott from Leibniz Supercomputing Center

Organizers: Dale Sartor from Lawrence Berkeley National Laboratory, David Martinez from Sandia National Laboratory and David Grant from Oak Ridge National Laboratory

Day 2 Opening: EE HPC WG Update

Recording: https://youtu.be/AAbtSF3CxbU?t=197

Session 3: ARM Operational Experiences (Invited Speakers and Q&A)

Preview: ARM Operational Experiences

Slack Channel: Link to #wksp20-arm_operational_exp

Recording: https://youtu.be/AAbtSF3CxbU?t=1156

Description: In recent years, ARMv8 64bit architecture based systems are considered as competitive, viable alternatives for HPC. New functionalities such as ARMv8-A Scalable Vector Extension (SVE) can deliver the required performance for HPC workloads, and the hardware features for power & energy efficiency makes the ARM architecture an interesting choice in the data center. Also, substantial growth in usability of system software, ARM tailored HPC workloads, and the commercial availability of ARM nodes from HPC vendors demonstrate the viability for HPC. Acknowledging this trend with ARM, we have invited a panel of three speakers from sites that have deployed ARM architecture based HPC systems (Isambard (GW4 alliance, UK), Astra (DOE/NNSA, US), and Fugaku (RIKEN, Japan). With this panel, we will explore their lessons learned throughout the design, deployment and operations of these ARM HPC systems.

Presentation Titles, Speakers and Presentation Abstracts:

"ARM Operational Experiences - Isambard 1 & 2" by Simon McIntosh-Smith from the University of Bristol, U.K.. Isambard is the world’s first production ARMv8-based supercomputer. Recently upgraded to Isambard 2, the system includes 21,000+ ARMv8 cores and 72 nodes of Fujitsu’s A64fx processor, the ARM CPU in Fugaku, the #1 system in the current Top500. Isambard 2 also provides access to other mainstream and emerging architectures, enabling ‘apples-to-apples’ comparisons using the same software stack all within one system. By presenting the Isambard 2 ARM evaluation results that leverages the unique opportunity of multi-architecture access, this talk will present an architectural comparison between Fujitsu A64fx and Marvell ThunderX2 ARM-based CPUs and other architectures, highlighting the impact of ARM-architecture for high-performance computing.
"Chronicles of Astra: Challenges and Lessons from the First Petascale Arm Supercomputer" by Kevin Pedretti from Sandia National Laboratory. Astra is the first Petascale supercomputer based on 64-bit ARM processors. This talk will present the lessons learned from the process of bringing up Astra, and validating the ability to run HPC applications. Especially, the talk will highlight the experiences in deploying and tuning the Astra supercomputer for performance & energy efficiency.
"Early Operational Experiences on Fugaku" by Fumiyoshi Shoji from RIKEN. Fugaku is an arm-based large-scale supercomputer developed by RIKEN and Fujitsu and consists of more than 150k computing nodes. This talk will discuss the early operational experiences on Fugaku, the ARM-based large-scale supercomputer developed by RIKEN and Fujitsu. In particular, the speaker will introduce the energy-saving functions of the Fugaku system and its effectiveness and will report how RIKEN will take advantage of the functions to achieve energy-efficient operations.

Moderator: Woong Shin, Oak Ridge National Laboratory

Organizers: James Laros from Sandia National Laboratory, Oscar Hernandez from Oak Ridge National Laboratory and Ross Miller from Oak Ridge National Laboratory

Session 4: Towards the Practical Use of Machine Learning in the HPC Data Center (Invited Speakers and Q&A)

Preview: Towards the Practical Use of Machine Learning in the HPC Data Center

Slack Channel: Link to #wksp20-machine_learning_uses

Recording: https://youtu.be/AAbtSF3CxbU?t=3965

Description: This session will discuss using ML in the data center. It will highlight different approaches and will go into more details by focusing on some concrete use-cases and solutions using ML.

Presentation Titles and Speakers:

"Machine Learning for HPC Data Centers to Improve Energy Efficiency and Resiliency" by Sergey Serebryakov from Hewlett Packard Enterprise
"Learning Energy-Aware Management of Modern Data Centers" by Hayk Shoukourian from the Leibniz Supercomputing Center. Power consumption continues to remain one of the major challenges for deploying and operating next generation pre-Exascale and Exascale supercomputers. Apart from operational costs, the next generation High Performance Computing (HPC) systems, as major power consumers, are projected to affect the stability of underlying power grid. That in turn requires certain control mechanisms that will ensure data center’s adherence to the predefined power envelopes and will continuously match the actual behavior of the target computing infrastructure to the current availability of power and cooling. The talk will present a machine learning based framework, referred to as Infrastructure Data Analyzer and Forecaster (IDAF), allowing the forecast of various energy/power consumption relevant, data-center level, Key Performance Indicators. The talk will outline the use cases of the suggested framework by real-time control mechanisms for energy-efficiency improvements in modern data centers.
"Machine Learning-Based Performance Analytics on High-Performance Computing Systems" by Burak Aksar from Boston University

Moderator: Torsten Wilde, HPE

Organizer: Natalie Bates, EE HPC WG

==================================

Speaker Photo and Bio

Anna Maria Bailey is Lawrence Livermore National Laboratory’s (LLNL) HPC Chief Engineer. She holds a B.S. in Electrical Engineering from Cal Poly, San Luis Obispo and is registered with the California Board for Professional Engineers and Land Surveyors. She has more than 30 years of experience in multiple engineering roles at LLNL, notably serving as the design/construction manager for LLNL’s HPC Center, which houses some of the world’s most powerful supercomputers. In addition, she led the effort to earn LEED certification for two HPC facilities and is currently overseeing the planning of an Exascale facility modernization project to prepare for unprecedented Exascale facility infrastructure challenges. She also serves as the co-chair of the Energy Efficient HPC Working Group (EE HPC WG). Their mission is to reduce expenditure and curb environmental impact through increased efficiencies in HPC centers by encouraging the community to lead in energy efficiency as they do in computing performance.

Robert ‘Bob’ Bolz, AQUILA Inc. (Albuquerque NM)

30 years experience in sales, marketing and management spanning the Computer Industry as we know it today. Since the mid-90s Bob has been a proponent of the Open Source Software movement, and the Linux Operating System as an international standard bound to replace UNIX in Scientific simulation, and has spent the last 15 years focused on High Performance Computing. Most recently AQUILA and Clustered Systems developed the patented AQuarius™ warm water fixed-cold-plate cooling, providing a unique state-of-the-art DLC heat removal solution.

Rolf Brink, Founder and CEO, Asperitas

After nearly 20 years in IT and a successful circumnavigation with sailing yacht Helena, Rolf Brink founded the disruptive cleantech company Asperitas. His background in product development for cloud architectures and datacentre infrastructure, combined with a passion for cleantech innovations like liquid cooling, are the foundation of the development of Immersed Computing® within Asperitas.

Dominik Dziarczykowski, EMEA Regional Manager with CoolIT Systems

Direct Liquid Cooling and HPC Enthusiast with more than 20 years of experience in IT and Consulting business. Supporting Customers and Business Partners in increasing DC efficiency and optimizing TCO. Gained experience working previously for IBM, Dell, Huawei, Deloitte, and Capgemini. Keen on riding a bicycle and playing basketball.

Michael Ott, Leibniz Supercomputing Centre

Michael received his PhD in computer science from Technische Universität München in 2010 for his work in high performance bioinformatics. Before he joined LRZ in 2012 he was a postdoc with the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Canberra, Australia and an IT consultant in the automotive and the financial sector in Munich, Germany. He is now a senior researcher in the “High-Performance Systems” division of LRZ. Michael leads the Operational Data Analytics Team in the Energy Efficient HPC Working Group and the Energy Efficiency Working Group in the ETP4HPC. His research focuses on energy-efficiency and scalable monitoring, but he still keeps up an interest in bioinformatics, computer architecture, and parallel programming.

Natalie Bates has led the Energy Efficient High Performance Computing Working Group (EE HPC WG) since its inception in 2010. The purpose of the WG is to drive implementation of energy efficient design in HPC. Today, there are ~800 members from 25+ countries. Natalie has been the technical and executive leader for this ‘open source’ working group that disseminates best practices, shares information (peer to peer exchange), and takes collective action. The EE HPC WG has collaborated and negotiated with industry standards committees and major HPC organizations as well as influenced HPC system development. Prior to leading the EE HPC WG, Natalie's career spanned twenty years with Intel Corporation where she was a senior manager of highly complex programs taking new products to market, delivering multi-component and multi-partner platforms, and negotiating strategic technical industry initiatives.

Torsten Wilde (Ph.D.) is a system architect for Exascale monitoring and system power and energy management at Hewlett Packard Enterprise (HPE). His research activities are related to high volume, high frequency data collection and analytics for improved IT operations as well as dynamic power management. Torsten has published more than two dozen research papers mainly related to power and energy usage and improvement in High Performance Computing. Torsten is the lead architect for HPE's Exascale monitoring framework prototype developed as part of the ECP (Exascale Compute Project) funded PathForward project. Torsten received his MSc in parallel and scientific computation from the University of Liverpool, UK, and his MSc in Computer Engineering from the University of Applied Sciences in Berlin, Germany. He received his Ph.D. in computer science from the Technical University of Munich, Germany, in 2018.

Burak Aksar is a Ph.D. student at Boston University. He received his B.Sc. degree in Electronics Engineering with honors from the Sabanci University, Istanbul, Turkey. His current research interests include applied machine learning, explainable AI, management & monitoring of HPC systems.

Sergey Serebryakov is a senior research engineer at Hewlett Packard Labs. His research interests include machine learning, deep learning and their applications. Sergey received a Ph.D. from the Saint-Petersburg Institute of Informatics and Automation. Contact him at sergey.serebryakov@hpe.com.

Hayk Shoukourian received his M.Sc. and Ph.D. with “summa cum laude” in Computer Science from Technical University of Munich (TUM) in 2012 and 2015 correspondingly. He joined Leibniz Supercomputing Centre (LRZ) in 2012 and his R&D activities mainly involve efficient energy/power consumption management of the HPC data centers. In his current role as a lead scientist, Dr. Shoukourian is responsible for adaptive modeling of interoperability between the target HPC systems and the building infrastructure of the supercomputing site. He is also a team lead for PRACE (Partnership for Advanced Computing in Europe) work package on "HPC Planning and Commissioning" as well as leads the development of best practice guides for new architectures and systems. Since August 2018 Dr. Shoukourian is a lecturer in Computer Science at Ludwig-Maximilians Universität München (LMU). Hayk Shoukourian is a member of German Informatics Society (Gesellschaft für Informatik), and US DOE Energy Efficient High Performance Computing Working Group (EEHPC WG). He serves on Program Committees of several international conferences including International Supercomputing Conference (ISC), SCS High Performance Computing Symposium and IEEE/ACM High Performance Computing & Simulation and acts as a reviewer for Elsevier “Advances in Engineering Software” and IEEE Access journals.

Woong Shin (Ph.D.) is a HPC systems engineer in the AI Analytics Methods at Scale (AIMS) Group at Oak Ridge National Laboratory (ORNL). He is involved in research & development activities around designing system software & architecture for scientific applications on HPC systems. Woong started his career as a software engineer in the enterprise area, working for Samsung & TmaxSoft (South Korea) but later pursued academic training in system software, distributed systems, and computer architecture. He joined ORNL in 2017. He received his Ph.D. degree in electrical engineering and computer science (M.S. and Ph.D. integrated course) in 2017 from Seoul National University, South Korea. He earned his B.S. in computer science from Korea University, South Korea.

Simon McIntosh-Smith is a full Professor of High Performance Computing at the University of Bristol in the UK. He began his career in industry as a microprocessor architect, first at Inmos and STMicroelectronics in the early 1990s, before co-designing the world's first fully programmable GPU at Pixelfusion in 1999. In 2002 he co-founded ClearSpeed Technology where, as Director of Architecture and Applications, he co-developed the first modern many-core HPC accelerators. He now leads the High Performance Computing Research Group at the University of Bristol, where his research focuses on advanced computer architectures and performance portability. He plays a key role in designing and procuring supercomputers at the local, regional and national level, including the UK’s national HPC service, Archer. In 2016 he led the successful bid by the GW4 Alliance along with the UK’s Met Office and Cray, to design and build ‘Isambard’, the world’s first production, ARMv8-based supercomputer.

Kevin Pedretti is a Principal Member of Technical Staff at Sandia National Laboratories. He has helped develop several large-scale parallel computers, including the Red Storm system that was productized as the Cray XT line of supercomputers and Astra, the first Petascale supercomputer based on Arm processors. Prior to joining Sandia in 2001, he studied engineering at the University of Iowa where he received a B.S.E. in Electrical Engineering in 1999 and an M.S. in Computer Engineering in 2001. His current research interests include operating systems for massively parallel supercomputers, high-performance and scalable networking, power management, and hardware virtualization in the context of high performance computing.

Fumiyoshi Shoji (Ph.D.) is a division director at operations and computer technologies division, RIKEN Center for Computational Science (R-CCS), responsible for operation and enhancement of the HPC system and the facilities including substation, chillers, gas turbine power generators, air handlers, etc. His current technical interests include operation improvements of large scale supercomputers and facility in energy/cost efficiency. He was awarded the ACM Gordon Bell prize in 2011.

Siddhartha Jana (Ph.D.) is a research scientist at Intel Corporation and the conferences co-lead within the EE HPC WG (Energy Efficient HPC Working Group). He holds a doctorate from the University of Houston in energy efficiency and distributed memory programming models. At Intel, his research projects are driven towards leveraging hardware features to explore energy efficiency within the HPC software stack. His other research interests include programming models, High Performance Computing, compiler design and analyses, runtime systems, communication libraries, and distributed computing. As part of his research, he has collaborated with a number of organizations across academia, government, and the industry including Total, Oak Ridge National Laboratory, Technische University, Dresden, Intel, Los Alamos National Laboratory and Cray Inc. With his two hats on - Intel and EE HPC WG, Sid is actively collaborating on HPC PowerStack, a community-wide effort to design a unified HPC system stack that will facilitate building system-wide power efficiency solutions for future large-scale machines.

Mani Prakash (Ph.D.) is a Senior Principal Engineer and Chief Technologist for HPC in the Data Center Products Group (DPG) at Intel Corporation. Mani has a Ph.D. is Mechanical Engineering and has been with Intel for 23+ years. He has contributed to many aspects of Server product, memory, system and data center development. His role for the past few years at Intel is in the Power, Package & Cooling domain for HPC, tailoring products internally for HPC and working with OEM’s and End Customers. He is a Fellow of American Society of Mechanical Engineers (ASME), Senior Member of IEEE and is very active in the ASHRAE TC9.9 committee, working with industry peers in generating standards and guidelines for Power and Cooling in Data Centers.

Karl Kersey, P.E., is an Energy Engineer and Staff Power Monitoring Consultant with Schneider Electric. He is experienced in providing solutions guiding clients to greater profitability while simultaneously preserving the environment. During the past 20 years he has helped over 500 organizations improve profits with power management solutions. He has worked with clients such as Honda, Mercedes, Microsoft, Google, Dalhousie University, Faith Technologies, and many others. Karl has also worked as an Application Engineer, Design Engineer, and Product Manager. Most recently Karl has been the Power Quality Specialist for Power Management University. He is currently developing solutions training for Power Quality, Data Center Applications, and Utility Applications. Karl has been published in Control Engineering Magazine and Buildings Magazine.

Dr. Steve Harrington is the Chief Technology Officer for Chilldyne. Dr. Harrington has over 27 years of commercial experience in the fields of fluid dynamics, thermodynamics and electronics cooling. His unique experience runs the gamut from consumer product development to science. He has been responsible for numerous successful product development projects and IP generation. He holds a dozen U.S. patents and is named on many more. Dr. Harrington also teaches the Senior Aerospace Design class at UCSD.

Steven Dean is a Distinguished Technologist at HPE specializing in High Performance Computing (HPC) system hardware architectures and is well versed in all aspects of HPC system packaging, power, cooling and controls solutions. He is a resident of the United States and holds a BS degree in Mechanical Engineering from the University of Texas and an MS degree in the Management of Technology from the University of Minnesota. Steve has worked in the HPC industry for 35 years, working initially at Cray Research for 10 years, then at SGI for 21 years until the acquisition of SGI by HPE and has been at HPE since the acquisition in 2016. Steve started his career at Cray working on high powered refrigeration cooled (X-MP) and liquid cooled (Y-MP) systems before moving to air cooled solutions (Y-MP EL & J90). At SGI, he continued to lead the work on high end air cooled systems including the Origin product series, Altix product series and then for the ICE product series where power densities drove cooling approaches back to liquid cooling (ICE-X and ICX-EX series). At HPE he led the development of the Apollo 9000 liquid cooled system and currently contributes to evaluating next generation system architecture concepts. He has been working on developing liquid cooled system designs for the past decade and has been granted a number of patents for developments in this area.

Sadie Joy works as a consultant with kW Engineering. Her background in improving building control and analyzing building performance has led her to a great appreciation of building analytics. In her work with LBNL she has provided SkySpark integration support, developed views for ongoing analysis, and supported energy efficiency goals by providing ongoing commissioning and monitoring at the NERSC facility.