EE HPC SOP 2020 Technical Program

Welcome! The EE HPC SOP 2020 Workshop will occur on Monday, September 14th at 12:00 UTC.

This translates to 5AM PT, 6AM MT, 7AM CT, 8AM ET, 2PM CEST, and 9PM JT.

It will be a 3 hour workshop. Thank you everyone for accommodating multiple time zones for this virtual event.

==================================

REGISTRATION: https://clustercomp.org/2020/registration/

You must register for Cluster2020 and select EE HPC SOP Workshop as part of the registration in order to attend.

You will receive an email directly from the Conference Organizers with a secure Zoom meeting link.

==================================

Below, please find information on each paper including:

1. Pre-recorded video presentation for each paper.

2. PDF copy of each paper

3. A link to the EE HPC SOP Slack Workspace with a Slack channel designated for each paper.

Consider this a venue for 'hallway conversation' with other workshop participants. It is a Q&A for papers that starts before the workshop and will be maintained after the workshop is over. This provides participants a direct line to the authors, organizers, and other participants.

4. On the day of the workshop (September 14th), this page will be updated with presentation materials.

==================================

AGENDA (Times are cumulative relative to the start time)

00:00 to 00:15 Introduction and Keynote- Dave Martinez

00:15 to 01:30 Session 1, HPC System - Moderator Greg Koenig

01:30 to 01:40 Break

01:40 to 02:55 Session 2, HPC Facility - Moderator Jason Hick

02:55 to 03:00 Closure

Keynote

------------------------------------------------------

David Martinez is an Engineering Program/Project Lead and has worked in the Sandia National Laboratories Corporate Computing Facilities (CCF) for 35+ years. David is the subject matter expert for SNL’s data center operations and design due to his in-depth understanding and experience with HVAC, controls, and mechanical and electrical systems. David is frequently consulted by internal and external agencies for design review and his innovative approach to data center management and energy efficient operations and designs. SNL has received numerous energy efficiency awards as a result of these efforts. During his tenure, David has seen the data center operations move from about 20,000 sq. ft. to over 77,000 sq. ft. comprised of 3 unique data center environments.

View presentation here:

https://www.dropbox.com/s/g49yj3wptxv9kfp/0%20Keynote.pptx?dl=0

SESSION 1

------------------------------------------------------

Energy optimization and analysis with EAR

AUTHORS: Julita Corbalan, Lluis Alonso, Jordi Aneas, Luigi Brochard

KEYWORDS: Energy efficiency, System software, Energy Optimization, Application analysis, Data centers

ABSTRACT: EAR is an energy management framework which offers three main services: energy accounting, energy control, and energy optimization. The latter is done through the EAR runtime library (EARL). EARL is a dynamic, transparent, and lightweight runtime library that provides energy optimisation and control. EARL optimises energy by selecting the optimal CPU frequency, based on the energy policy selected and application runtime characteristics without any application modification or user input. Currently EARL only works for MPI applications but EAR itself can still operate for non-MPI applications. It automatically (and transparently) identifies iterative regions (loops) and computes a set of metrics per iteration, application signature, and, together with the system signature, applies energy models to estimate the execution time and power for the CPU frequencies available. System signature is a set of coefficients per-node computed during EAR installation via a learning phase. Given time and power projections, EARL selects the best frequency based on policy settings.

This paper shows how to optimize energy using the EAR library with min_time_to_solution energy policy and how to analyse applications through EAR framework. Evaluation includes eight applications with different sizes and application signatures. Results show how EARL computes each application signature on the fly and applies the CPU frequency selected by the min_time_to_solution policy.

View presentation materials here:

https://www.dropbox.com/s/b4vteem8nrciqom/1%20Energy%20optimization%20and%20analysis%20with%20EAR.pdf?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

www.dropbox.com/s/vh4m6z16igtpdlb/Julita%20Corbalan%20-%20Corbalan.Energy%20Optimization%20and%20Analysis%20with%20EAR.pdf?dl=0

Toward an End-to-End Auto-tuning Framework in HPC PowerStack

AUTHORS: Xingfu Wu, Aniruddha Marathe, Siddhartha Jana, Ondrej Vysocky, Jophin John, Andrea Bartolini, Lubomir Riha, Michael Gerndt, Valerie Taylor, Sridutt Bhalachandra

KEYWORDS: Power, energy, end-to-end tuning, auto-tuning, HPC, PowerStack

ABSTRACT: Efficiently utilizing procured power and optimizing performance of scientific applications under power and energy constraints are challenging. The HPC PowerStack defines a software stack to manage power and energy of high-performance computing systems and standardizes the interfaces between different components of the stack.

This survey paper presents the findings of a working group focused on the end-to-end tuning of the PowerStack. First, we provide a background on the PowerStack layer-specific tuning efforts in terms of their high-level objectives, the constraints and optimization goals, layer-specific telemetry, and control parameters, and we list the existing software solutions that address those challenges. Second, we propose the PowerStack end-to-end auto-tuning framework, identify the opportunities in co-tuning different layers in the PowerStack, and present specific use cases and solutions. Third, we discuss the research opportunities and challenges for collective auto-tuning of two or more management layers (or domains) in the PowerStack. This paper takes the first steps in identifying and aggregating the important R\&D challenges in streamlining the optimization efforts across the layers of the PowerStack.

View presentation materials here:

https://www.dropbox.com/s/xfkod2j3rq90x96/2%20Toward%20an%20End-to-End%20Auto-tuning%20Framework%20in%20HPC%20PowerStack.pptx?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

www.dropbox.com/s/fun3jqckzp84c7a/Toward%20an%20End-to-End%20Auto-tuning%20Framework%20in%20HPC%20PowerStack.pdf?dl=0

Evaluation of Power Controls on Supercomputer Fugaku

AUTHORS: Yuetsu Kodama, Tetsuya Odajima, Eishi Arima, Mitsuhisa Sato

KEYWORDS: supercomputer Fugaku, power controls, power-knobs, clock frequency scaling, low-power state, variation of power

ABSTRACT: The supercomputer "Fugaku", which recently ranked number one in multiple supercomputing lists including Top500 in June 2020, has various power control features such as (1) eco mode that utilizes only one of two floating-point pipelines while decreasing the power supply to the chip; (2) boost mode that increases clock frequency; and (3) core retention that turns unused cores into low-power state. By orchestrating these power-performance features while considering the characteristics of running applications, we can potentially gain even better system-level energy efficiency. In this article, we report the effectiveness of these features by using the pre-evaluation environment for Fugaku. Consequently, we confirmed several prominent results useful for the Fugaku system operation including: remarkable power reduction and energy-efficiency improvement by coordinating the eco mode and core retention in memory intensive case; 10\% of speed-up with 17\% of power increase by the boost mode in CPU intensive case; and considerable power variations across over 20K nodes.

View presentation materials here:

https://www.dropbox.com/s/siazqkzpueqjfax/3%20Evaluation%20of%20Power%20Controls%20on%20Supercomputer%20Fugaku.pptx?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

https://www.dropbox.com/s/h3m48tqrygefqrs/Evaluation%20of%20Power%20Controls%20on%20Supercomputer%20Fugaku.pdf?dl=0

HUD-Oden: A Practical Evaluation Environment for Analyzing Hot-Water Cooled Processors

AUTHORS: Jorji Nonaka, Fumiyoshi Shoji

KEYWORDS: Energy efficiency, liquid cooling, hot-water cooling, power consumption, frequency throttling

ABSTRACT: Liquid cooling has been rapidly becoming the de facto standard cooling method for high performance/density racks of modern HPC/Data Centers. Semiconductor technology development has made it possible to operate processors (CPU, GPU, and Accelerators) at higher temperature ranges without compromising the reliability and static power consumption, and these contributed in part to increase the attention over the “hot water cooling” as one of the main approaches for energy efficient system design. The 2011 ASHRAE Class W4 allows water supply temperature up to 45oC, and even higher temperature for the Class W5. A clear understanding of the temperature impact on the processors (CPU, GPU, and Accelerators) would be valuable for assisting the HPC operational staff for their strategic planning and decision making. In this short paper, we present our experience using a simple and cost effective bench testing environment for analyzing the operational behavior of the processors in such high temperature conditions. Although it is far from ideal, since we are not using the same building blocks of the current running HPC system, we consider a valuable alternative for observing the operational behavior of the processors in such temperature environment, and may obtain supportive evidence for assisting strategic planning and decision making.

View presentation materials here:

https://www.dropbox.com/s/62jntohspo66he9/4%20HUD-Oden%3A%20A%20Practical%20Evaluation%20Environment%20for%20Analyzing%20Hot-Water%20Cooled%20Processors.pptx?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

www.dropbox.com/s/a5aazjgal48tuxz/HUD-Oden%3A%20A%20Practical%20Evaluation%20Environment%20for%20Analyzing%20Hot-Water%20Cooled%20Processors.pdf?dl=0

SESSION 2

------------------------------------------------------

Global Experiences with HPC Operational Data Measurement, Collection and Analysis

AUTHORS: Michael Ott, Woong Shin, Norman Bourassa, Torsten Wilde, Stefan Ceballos, Melissa Romanus, Natalie Bates

KEYWORDS: exascale, Top500, HPC operations, energy efficiency, site survey, operational data, ODA

ABSTRACT: As we move into the exascale era, supercomputers grow larger, denser, more heterogeneous, and ever more complex. Operating such machines reliably and efficiently requires deep insight into the operational parameters of the machine itself as well as its supporting infrastructure. To fulfill this need, early adopter sites have started the development and deployment of Operational Data Analytics (ODA) frameworks allowing the continuous monitoring, archiving, and analysis of near real-time performance data from the machine and infrastructure levels, providing immediately actionable information for multiple operational uses.

To understand their ODA goals, requirements, and use cases, we have conducted a survey among eight early adopter sites from the US, Europe, and Japan that operate top 50 high-performance computing systems. We have assessed the technologies leveraged to build their ODA frameworks, identified use cases and other push and pull factors that drive the sites’ ODA activities, and report on their operational lessons.

View presentation materials here:

https://www.dropbox.com/s/g3bsqb60i666fnz/5%20Global%20Experiences%20with%20HPC%20Operational%20Data%20Measurement%2C%20Collection%20and%20Analysis5%09Global%20Experiences%20with%20HPC%20Operational%20Data%20Measurement%2C%20Collection%20and%20Analysis.pdf?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

https://www.dropbox.com/s/a3lffmo6cytg8iv/Global%20Experiences%20with%20HPC%20Operational%20Data%20Measurement%2C%20Collection%20and%20Analysis.pdf?dl=0

A Study of Operational Impact on Power Usage Effectiveness using Facility Metrics and Server Operation Logs in the K Computer

AUTHORS: Masaaki Terai, Fumiyoshi Shoji, Toshiyuki Tsukamoto, Yukihiro Yamochi

KEYWORDS: power usage effectiveness, energy efficiency, co-generation system, K-computer

ABSTRACT: The official service of the K computer ended in 2019. Most of the equipment except for servers are enhanced and continue to use as part of the infrastructure for the successor system named Fugaku. To ensure stable and energy-efficient operations in the next decade, understanding the facility behavior in the period of the K computer is valuable.

The K computer was powered by two energy sources: purchased electricity from a utility company and generated energy by gas turbine power generators on the premises. To evaluate the energy efficiency of the entire center, we use the modified power usage effectiveness (PUE) metric that considers different forms of energy source purchased from utility companies and shows the metric in the service period. To analyze the effect of operational impact on PUE, we use both the facility metrics and the server operation metrics extracted from the logs of the K computer. Further, using the three cases with the metric data, we reveal that some maintenance operations degrade PUE. Especially, the annual maintenance operations compared with emergency operations tend to affect the PUE metric. Finally, we show that there is an operational issue regarding the gas co-generation system as a preliminary study.

View presentation materials here:

https://www.dropbox.com/s/6ztkqld9599e27w/6%09A%20Study%20of%20Operational%20Impact%20on%20Power%20Usage%20Effectiveness%20using%20Facility%20Metrics%20and%20Server%20Operation%20Logs%20in%20the%20K%20Computer.pdf?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

https://www.dropbox.com/s/k8pxj9wtxct840j/A%20Study%20of%20Operational%20Impact%20on%20Power%20Usage%20Effectiveness.pdf?dl=0

A Supercomputing Center Case Study on Cooling Control Design

AUTHORS: Michael Kercher, Gary New

KEYWORDS: data center, controls, cooling

ABSTRACT: After designing and implementing an automated control system for a new HPC center, the National Center for Atmospheric Research (NCAR) elected to use a simpler operator-based solution. The solution has proven successful, and this case study documents the reasons for both the decision and the process used to choose it. Additional refinements to the cooling system controls are also documented and their adoption explained.

View presentation materials here:

https://www.dropbox.com/s/ttvcu0ncfjeg70e/7%20A%20Supercomputing%20Center%20Case%20Study%20on%20Cooling%20Control%20Design.pptx?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

https://www.dropbox.com/s/o5ed2gr8fbzsc6b/A%20Supercomputing%20Center%20Case%20Study%20on%20Cooling%20Control%20Design.pdf?dl=0

Investigative Report on Electrical Commissioning in HPC Data Centers

AUTHORS: Joseph Prisco, Grant Stewart, Herbert Huber, Randy Rannow, Jason Hick, Dave Martinez, Brandon Hong, Aditya Deshpande

KEYWORDS: commissioning, electrical infrastructure, high performance computing

ABSTRACT: The Energy Efficient High Performance Computing Working Group (EE HPC WG) has assembled a small diverse team to write a short investigative report on electrical commissioning. The purpose of the investigative report is to evaluate the need for electrical commissioning guidelines specific to High Performance Computing (HPC) data centers given their unique IT equipment load densities and power profiles. It is the consensus of the team that special electrical commissioning guidelines are needed and the EE HPC WG will author the initial guidelines. The scope of the guidelines will include the static and dynamic electrical aspects of commissioning practices that are specific to high performance computing and more importantly, cover the transient aspects of electrical commissioning. The fluctuating nature of many compute nodes can dramatically influence generation, transmission, and distribution of electrical power. HPC data center lessons learned and best practices will be examined and used to enhance the electrical commissioning guidelines. The primary audience for the guidelines is facility engineers and operators of HPC data centers. The guidelines will also be applicable to others that support HPC data centers, ranging from utilities and their electrical grid infrastructure to IT equipment manufacturers whose machines are being commissioned at the end of the process.

View presentation materials here:

https://www.dropbox.com/s/ga76lm86fsaf1vs/8%20Investigative%20Report%20on%20Electrical%20Commissioning%20in%20HPC%20Data%20Centers.pptx?dl=0

Slack link:

https://join.slack.com/t/eehpcwg/shared_invite/zt-hakykqvl-34mkigfUdCOqv26~tgA4pg

Access the PDF of the paper here:

https://www.dropbox.com/s/52cmvenegix8802/Investigative%20Report%20on%20Electrical%20Commissioning%20in%20HPC%20Data%20Centers.pdf?dl=0