EE HPC WG: Dashboard Team

Sustainably supporting science through committed community action

The Dashboard Team has published guidelines on general recommendations for selecting energy efficiency elements of HPC data center dashboards (see publications below). A dashboard is a display that is used to provide critical feedback to the users. Carefully selecting the elements to be displayed on the energy dashboard is important, as energy management is a shared responsibility of all stakeholders: operations managers, facilities managers, and system administrators.

A survey of major HPC data centers in the US, Europe and Japan was done in 2017 to assess the current state of using dashboards for monitoring and managing energy efficiency. Three out of eleven sites were identified as having tightly integrated and very capable systems. All of the remaining eight sites reported having dashboards that were still partial and under construction.

Eleven sites participated in the survey, which was over a 50% response rate. Six were from the United States (LLNL, LBNL/NERSC, NREL, LANL, ANL, PNNL); four from Europe (LRZ, HLRS, ECMWF, and CEA) and one from Japan (RIKEN). The survey results reflected facility-manager stakeholders more than IT managers.

Summary:

Several of the sites have very sophisticated capabilities. Tightly integrated with many data sources. From this questionnaire, they provide the high-bar.

Other sites have isolated (not tightly integrated) solutions.

There is a desire for more/better capabilities from every site.

Q1 - Do you have dashboards that you use to help with operational energy management of your HPC center and/or HPC system?

All sites have some sort of dashboard, but the capabilities are wide-ranging.

Some sites use custom solutions while others use off-the shelf solutions.

Q2 - List specific elements/energy data is being collected by the dashboard system and what elements are being displayed?

In almost all cases, dashboards display: power & energy at various levels (PDU, chiller, rack), water and air temperatures, floor map.

Some show PUE and some calculate PUE offline.

In most cases collected data is stored forever.

With different system being used, there did not seem to be any consistent way of what is displayed or how.

Q3 - How is the energy dashboard data being used now for operational energy management and energy efficiency?

All of the sites mention trend data.

Most of the sites also mention continuous improvement/optimization.

Several sites specifically mention calculating PUE

Q4 - What were your expectations when the data center decided to implement an energy dashboard?

Most of the sites were interested in continuous improvement/optimization of facility systems, especially cooling and power.

A few of the sites mention calculating PUE.

Q5 - Does the current energy dashboard (and any supporting software) meet your expectations?

Eight sites said yes, although with room for improvement

Q6 - Given the information and/or limitations of your current energy dashboard (and supporting software) and the advancement of technology in this area, list your new expectations (or features) for dashboard systems to improve operational energy management and efficiency within two years (or earlier).

The next step would be to integrate dashboard environmental data with job scheduling on HPC systems to improve overall operations. Current studies are underway and this is a 5 year outlook.

We have a number of data elements that we would like to add to our centralized data collection system (some mentioned above), and begin to correlate among different kinds of data. We are looking at collecting job data in the system and correlating it with energy usage, for example.

Most of the basic capabilities of the dashboards are well established. But as working prototypes. If time/interest is available, would be go back and consolidate to an overall cohesive approach. Most likely a new approach would use an ElasticSearch Logstash Kibana (ELK) stack, or a Influx/Telegraf/Grafana stack. Investment would need to be made in Logstash (ELK) or Telegraf to enable interface with energy monitoring systems. NREL currently uses SNMP, Modbus, BACnet, screen-scraping from webpages, custom interfaces to vendor-proprietary implementations such as ILO (HPE), XMLRPC, and a few other odds and ends to acquire data from the various sources. A unified approach would be extremely useful.

Our goal is to optimize total facility operation costs against system load, outdoor temperature, gas/electricity rates, etc. To realize it, we need some new functions for dashboard system as follows:

Real time monitoring for raw data of equipment (e.g. power consumption of chillers, inlet/outlet water temp. of chillers, etc.)

Smart energy management which can indicate best operation mode against the situation (e.g. system load, outdoor temp., gas/electricity rates, etc.)

Automatic operation of equipment (e.g. chillers, power generators, cooling towers, water pumps, air handlers, etc.) controlled by energy management system.

I expect a real dashboard on a single screen that would allow me to drill down through layers. The layout would include Best Practices that I have not even considered, should be ideas from many other data centers. There should be “canned” data searches based on those best practices to present to me relevant graphs and charts that I don’t even know I need at this moment. I don’t want to support the software, I want to farm that out to the vendor and just bask in the compiled data.

The dashboard we expect should show all data needed in a single view and should allow to directly drill down to all technical information needed. It should allow for flexible trends of pre-defined data analytics using best practices developed in house or at other sites.

Currently we are seeking to interface with the Johnson Controls Metasys N2 Gateway, which is managed by the building engineering department, and import SNMP data into our Schneider Niagara BMS and OpenDCIM where applicable. The ultimate goal is to compute the PUE of the data center while using monitoring and trend analysis to implement energy saving projects.

Schneider is continuously updating PME to keep up with the customer’s demands. Version 8.1 is expected to be released within the next few months.

\MAPE V2.0 software will soon be obsolete so we will have to choose to keep MAPE with a new software generation or to change to another product.

What I would like to see is an easier method of integration to allow information to be shared from BMS system to EMS systems. Greater BMS integration would allow more of the sensor points to be pulled to a central location where they can be monitored and analytics perform using a wider variety of data from the site’s infrastructure systems.

A finer granularity on electricity usage (e.g. per rack, cooling tower fan power etc.) would be helpful. However, we currently have no plans for a new dashboard.

Q7 - Feel free to enter your other general thoughts, observations, lessons learned, etc., on the dashboard system(s) you currently use.

Our primary lesson learned is how inaccurate various data sources are. Temperature and relative humidity sensors tend to be very imprecise. Likewise, current sensors tend to be imprecise at low utilization. We are going to move to higher precision sensors when installing new equipment.

Our dashboards have evolved over time. First iterations were primitive graphics to ensure that we understood the systems – end-to-end flow of electrical and thermal energy at a point in time. Figure 12 is in that type of primitive graphic format. Quite a bit of work was involved in acquiring data from the many data sources. Next iterations added ability to store data points displayed on the consolidated system graphics with basic graph capabilities. Figure 9 shows the primitive graphic capabilities. While primitive, this capability is still available and in daily use on most NREL dashboards."

When we need to extend or add some functions for the dashboard system, there no vender choice without system installation vender. We have to consider the better way to avoid such a vender lock in.

Nobody is happy with the current system. But everybody is used to it. Replacing a not so good system with another not so good system makes no sense. Implementing a dashboard that is a major improvement is a big project. No funding/staff is available.

Lessons learned, whilst software manufactures say that their software works with other 3rd party suppliers in reality not all of the points are made available to you as a user and you only find this out when you are along way in to the configuration of a system. PME is not as institutive to setup and use as some of the other Schneider products and requires a fair amount of engineering time to be set up by Schneider where as the DCE system was set up by our selves very easily.

Publications:

Bates N, Hsu CH, Imam N, Wilde T, Sartor D, "Re-examining HPC Energy Efficiency Dashboard Elements," Proceedings of the 12th Workshop on High Performance Power Aware Computing held in conjunction with the International Parallel and Distributed Processing Computing Symposium. Chicago, Illinois. 2016.
Sartor D, Mahdavi R, Radhakrishnan B, Bates N, et. al., “General Recommendations for High Performance Computing Data Center Energy Management Dashboard Display”. 9th Workshop on High-Performance Power-Aware Computing Conference, held in conjunction with the International Parallel and Distributed Processing Computing Symposium. Boston, MA 2013.

Page updated

Report abuse