Welcome to the ISC 2025 Sustainable Supercomputing workshop!
Providing a sustainable path for supercomputing is a pressing topic for our community, industry, and governments. Supercomputing has an insatiable appetite for computational cycles, while we face increasing challenges of delivering performance per Watt advances with silicon technology trends. All within the context of climate change, the drive towards net-zero, and economic pressures driven by geo-political challenges.
Improving the sustainability of supercomputing provides many opportunities when the end-to-end cycle is considered. From the design of computational circuits and systems; to the power and cooling that is used to operate them, along with the suite of software tools used to administrate, maintain, and raise operational efficiency of HPC systems. All elements of the system must be considered, from compute nodes and interconnects, to IO and storage components of the system.
This workshop will gather users, researchers, hardware and software developers to address opportunities and challenges of sustainability in the supercomputing context.
Agenda
Friday 13th June :
2.00pm : Introduction to workshop
2.05pm-2.50pm : Keynote : The journey from Fugaku to Fugato_next from an energy efficiency/sustainability perspective
Keiji Yamamoto, Unit leader of Advanced Operation Technology Unit, Operations and Computer Technologies Division, R-CCS, RIKEN.
Keiji was a technical lead of the feasibility study of facility design for Fugaku-next, which is scheduled to be available in 2030.
The feasibility study includes renewable energy awared facility design, heat reuse, energy storage and system operation for energy efficiency.
Keiji also is going to be the unit leader of the unit which is responsible for facility construction of the Fugaku-next R&D project
2.50pm-3.15pm – A European Supercomputing Center Perspective: 𝗚𝗿𝗲𝗲𝗻 𝗧𝗼𝗱𝗮𝘆, 𝗚𝗿𝗲𝗲𝗻𝗲𝗿 𝗧𝗼𝗺𝗼𝗿𝗿𝗼𝘄
Maximilian Höb is a researcher at the Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences and Humanities, focusing on containerized HPC applications, energy efficiency, and quantum computing integration. As part of his research, he investigates resource usage and scalability in container-based HPC environments and contributes to projects on hybrid workflows. His work supports the development of sustainable, exascale-ready infrastructures for scientific computing.
3.15pm-3.30pm – Green HPC: a new Carpentries lesson
Raising awareness and improving carbon literacy of all stakeholders in the provision, operation and use of HPC systems is critical to make progress on reducing emissions from the sector and compliments continued technological improvements. We have developed a new introductory-level, open-source lesson to introduce concepts and issues around emissions from HPC systems that is accessible to researchers using HPC systems, RSEs supporting researchers, operators of HPC systems and those involved in procuring HPC systems. The lesson also introduces concrete actions that can be taken by different stakeholders to reduce emissions. In this brief presentation I will give an overview of the lesson, the topics covered and how people can get involved in teaching and improving the materials. I would also like to ask the wider group what other training materials are needed beyond this lesson to support the HPC sector in reducing emissions. The lesson is part of the Carpentries Incubator and can be found at: https://github.com/carpentries-incubator/green-software-hpc
Andy Turner is a senior RSE at EPCC at the University of Edinburgh. His current work is focussed on supporting researchers to get the most out of UK national HPC resources such as the ARCHER2 and DiRAC services. For the past few years, he has been working to improve the understanding and quantification of emissions associated with HPC system use and how these can be reduced
3.30pm-3.45pm - Picking Low Hanging Fruit of the Energy Efficiency Tree
Legislative pressure due to carbon neutrality goals paired with economic incentives due to rising energy costs has increased the research focus on energy efficiency within the HPC community. New scientific discoveries push energy efficiency gains through the development of increasingly complex models and system adjustment strategies. But in practice, far less sophisticated methods are often sufficient to improve the energy efficiency of real world production systems. This lightning talk illustrates the energy saving potential of two system configuration changes at CLAIX-2023, the current compute cluster of RWTH Aachen University composed of a 630 node CPU and a 50 node GPU partition. First, the clock frequency reduction of idle batch nodes is described. Second, the possibility of modifications to chassis fan profiles is highlighted. For each of these adjustments, the potential energy savings are quantified. Additionally, the latter example highlights the need for continuous monitoring to avoid potential energy efficiency regressions going unnoticed. The contents of this talk are aimed to be intelligible to all audience groups while giving relevant starting points for any energy efficiency efforts.
Christian Wassermann is a PhD candidate in the HPC group at the IT Center of RWTH Aachen University. Since 2016, he has been actively involved in this research group while pursuing his Bachelor's degree in Scientific Programming and his Master's degree in Computer Science. Having worked on different HPC topics, including performance analysis and modeling of parallel applications, his current research focuses on the energy-aware optimization of HPC center operations. This involves tasks like operational data analytics, for which he operates the cluster performance and energy monitoring system of the RWTH compute cluster and its associated building infrastructure. In 2024, he started contributing to the Energy Efficiency HPC Working Group (EE HPC WG) by providing feedback on the usability of the Green500 methodology and by adding a new examples section to the Green500 methodology document.
3.45pm-4.00pm – The modification of a power fluctuation monitoring system for protecting cooling facilities in Fugaku
The A64FX processor used in Fugaku provides power-saving features including retention, controllable at the core and node level. For nodes assigned with jobs, "core retention" reduces idle cores' power consumption, while "node retention" saves power for unused nodes. Ideally, we would enable these retention features across the entire system immediately for energy savings. However, this causes larger power fluctuations that cooling facilities with mechanical operations cannot follow, creating failure risks that require protection mechanisms. Additionally, the previous power monitoring system detected exceedances too slowly. We updated it to monitor all racks' power consumption at a 1Hz frequency. After the update, when power falls below set limits, the system disables node retention for idle nodes to temporarily increase consumption, preventing sudden drops. Conversely, for cases exceeding upper limits, we implemented a job-cutting mechanism. In this lightning talk, we will introduce our project's approaches to these challenges.
Masaaki Terai, Technical Scientist at the Facility Operations and Development Unit, Operations and Computer Technologies Division, RIKEN Center for Computational Science (R-CCS), with research interests in operational data analysis, data center equipment optimization, and research support services. He has been with RIKEN since 2010, working on the K computer and Fugaku supercomputing projects. He earned his Ph.D. in Information Science from Japan Advanced Institute of Science and Technology (JAIST) in 2006. His research focuses on energy efficiency and management for power and cooling systems for high-performance computing environments.
4.00pm-4.30pm – Coffee Break
4.30pm-4.45pm – Power and Energy Limitations and Solutions
Power and energy are becoming limiters of performance from the chip to the data center. In this talk I will describe how tuning for energy and power can enable better performance and data center throughput. I will also give a quick overview a new software capability, Workload Power Profiles, released with the Blackwell GPU aimed at reducing the effort for tuning code using power, for best performance or best performance per Watt.
Ian Karlin is a Principal Engineer in the Accelerated Product Group at NVIDIA, where he leads large RFP responses. He also works with engineering teams on benchmarking, power and energy, and other future looking technologies. Prior to NVIDIA he was at Lawrence Livermore National Laboratory for almost 10 years.
4.45pm-5.00pm – Early Research in Load-Following Management for HPC-Nuclear Integration
With the rising demand for high performance computing (HPC) and artificial intelligence (AI) systems, maintaining a stable and efficient power supply is increasingly critical. The HPC team at Idaho National Laboratory is spearheading efforts to seamlessly integrate HPC systems with nuclear reactors. This lightning talk explores one early strategy for managing power fluctuations using software-defined controls. To effectively harness nuclear reactors for power generation, control mechanisms are essential to address the slow load-following capabilities of reactors, which are typically around 5% per minute. While this rate is sufficient for many uses, large HPC systems can experience rapid power consumption changes by tens of megawatts when jobs start or stop running. A reactor could overproduce power and match the peak power rating for the HPC system, however when the system is not running a job or a job unexpectedly stops, the load-following of the system would be affected leading to power being wasted and the likelihood of power transient occurrences increases. Controlling the increase or decrease of power consumption on these systems at the same rate as the load-following of reactors is one piece of the puzzle to properly utilizing nuclear reactors as a power source for HPC systems.
Brandon Biggs is a High Performance Computing (HPC) Systems Administrator in Idaho National Laboratory’s Advanced Scientific Computing Department. He has experience with HPC system management, software integration, web development, and machine learning. He is working on a PhD at University of Idaho in Idaho Falls, Idaho. His research interests focus on sustainable and energy efficient HPC. He obtained a M.S and B.S in computer science from Idaho State University.
5.00pm-5.15pm - Integrating Energy-Aware Scheduling with a Slurm Site Factor Priority Plugin
As energy demand from High Performance Computing (HPC) systems grows, it places increasing strain on electric grid infrastructure, highlighting the need for intelligent load balancing strategies. Energy-aware scheduling offers a promising solution by enabling HPC systems to act as responsive, grid-aware loads. While extensively studied in theory, real-world deployment remains limited, partly due to challenges integrating new strategies into trusted schedulers like Slurm. To address this, we propose a low-risk implementation pathway that leverages Slurm’s native site factor plugin to introduce energy-aware job prioritization without modifying the core scheduler. This approach requires only three inputs: predicted job power, system time, and an external energy signal. Power predictions are generated at job submission via a Lua hook that sends the job script to a reserved GPU node, where it is encoded with a Large Language Model, matched against a vector database, and the result is stored in the job’s comment field. An external service writes the energy signal to a shared file on the HPC filesystem. The site factor plugin accesses these inputs to dynamically reprioritize jobs. This architecture supports incremental deployment, aligns with existing workflows, and provides a practical path for integrating energy-awareness into HPC operations.
Kevin Menear is a research scientist in the Digital Twins and Scalable Workflows group at the National Renewable Energy Laboratory (NREL). His research focuses on operational data analytics for high-performance computing systems, with an emphasis on developing predictive models using machine learning and advancing scheduling simulation methodologies. His broader interests lie in the modeling and simulation of dynamic, interconnected systems, particularly those involving tightly coupled computational and physical components, and in designing intelligent, data-driven control strategies to address complex operational challenges.
5.15pm-5.30pm - Case for all-year round waste heat reuse
DKF (KIFU) installed its latest Komondor HPC hosting facility to enable waste heat reuse. The energy crisis in Europe crated a business case for the nearby swimming pool to connect its pool heating to the warm water loop of our HPE Cray EX system. The project is running for almost a year with success. Plans of our next HPC development and commitment of more heat reuse. This talk will cover some of the practical lessons learned when implementing a heat reuse scheme.
Zoltan Kiss is the head of Hungarian HPC activities since 2013. Operating warm water cooled systems since 2015. Planning ICT projects with sustainability in mind. ICT/Software Development with DevOps mindset. AI infastructure planning.
5.30pm-6.00pm Panel of speakers/Extended Q&A
For the lightning talk session, we solicit talks on all topics related to sustainable supercomputing, from data centres to hardware, and for system software to applications. The topics of interest include, but are not limited to, the following:
Application energy/power efficiency
System and data centre power monitoring
System software for managing/optimising power
Energy efficient data centres
Embodied carbon
Please submit a single paragraph description (max 200 words) of your prosed lightning talk (~10 mins), and a short bio (max 200 words) of yourself to woodacre@hpe.com with the subject ISC25 lighting talk
Deadline for submissions : April 18th 2025
Mike Woodacre, Hewlett Packard, Stateline, Nevada, USA
Pekka Manninen, CSC, Helsinki, Finland
Fujiyoshi Shoji, Riken R-CCS, Japan
Michele Weiland, EPCC, The University of Edinburgh, UK
Jim Rogers, Oak Ridge National Lab, Tennessee, USA
Natalie Bates, Energy Efficient HPC Working Group, USA