EE HPC State of the Practice Workshop  Presentations 2020

Event: EE HPC SOP 2020 Technical Program

Date: Monday, September 14th

Time: 12:00 UTC,  5AM PT,  6AM MT, 7AM CT, 8AM ET, 2PM CEST, and 9PM JT.

Duration: 3 Hours

Registration Website:



Energy optimization and analysis with EAR


AUTHORS: Julita Corbalan, Lluis Alonso, Jordi Aneas, Luigi Brochard


KEYWORDS: Energy efficiency, System software, Energy Optimization, Application analysis, Data centers


ABSTRACT:  EAR is an energy management framework which offers three main services: energy accounting, energy control, and energy optimization. The latter is done through the EAR runtime library (EARL). EARL is a dynamic, transparent, and lightweight runtime library that provides energy optimization and control. EARL optimise's energy by selecting the optimal CPU frequency, based on the energy policy selected and application runtime characteristics without any application modification or user input. Currently EARL only works for MPI applications but EAR itself can still operate for non-MPI applications. It automatically (and transparently) identifies iterative regions (loops) and computes a set of metrics per iteration, application signature, and, together with the system signature, applies energy models to estimate the execution time and power for the CPU frequencies available. System signature is a set of coefficients per-node computed during EAR installation via a learning phase. Given time and power projections, EARL selects the best frequency based on policy settings.


This paper shows how to optimize energy using the EAR library with  min_time_to_solution energy policy and how to analyses applications through EAR framework. Evaluation includes eight applications with different sizes and application signatures. Results show how EARL computes each application signature on the fly and applies the CPU frequency selected by the min_time_to_solution policy.




Toward an End-to-End Auto-tuning Framework in HPC PowerStack


AUTHORS:  Xingfu Wu, Aniruddha Marathe, Siddhartha Jana, Ondrej Vysocky, Jophin John, Andrea Bartolini, Lubomir Riha, Michael Gerndt, Valerie Taylor, Sridutt Bhalachandra


KEYWORDS:  Power, energy, end-to-end tuning, auto-tuning, HPC, PowerStack


ABSTRACT:  Efficiently utilizing procured power and optimizing performance of scientific applications under power and energy constraints are challenging. The HPC PowerStack defines a software stack to manage power and energy of high-performance computing systems and standardizes the interfaces between different components of the stack.


This survey paper presents the findings of a working group focused on the end-to-end tuning of the PowerStack. First, we provide a background on the PowerStack layer-specific tuning efforts in terms of their high-level objectives, the constraints and optimization goals, layer-specific telemetry, and control parameters, and we list the existing software solutions that address those challenges. Second, we propose the PowerStack end-to-end auto-tuning framework, identify the opportunities in co-tuning different layers in the PowerStack, and present specific use cases and solutions. Third, we discuss the research opportunities and challenges for collective auto-tuning of two or more management layers (or domains) in the PowerStack. This paper takes the first steps in identifying and aggregating the important R\&D challenges in streamlining the optimization efforts across the layers of the PowerStack.




Evaluation of Power Controls on Supercomputer Fugaku


AUTHORS:  Yuetsu  Kodama, Tetsuya Odajima, Eishi Arima, Mitsuhisa Sato


KEYWORDS:  supercomputer Fugaku, power controls, power-knobs, clock frequency scaling, low-power state, variation of power


ABSTRACT:  The supercomputer "Fugaku", which recently ranked number one in multiple supercomputing lists including Top500 in June 2020, has various power control features such as (1) eco mode that utilizes only one of two floating-point pipelines while decreasing the power supply to the chip; (2) boost mode that increases clock frequency; and (3) core retention that turns unused cores into low-power state. By orchestrating these power-performance features while considering the characteristics of running applications, we can potentially gain even better system-level energy efficiency. In this article, we report the effectiveness of these features by using the pre-evaluation environment for Fugaku. Consequently, we confirmed several prominent results useful for the Fugaku system operation including: remarkable power reduction and energy-efficiency improvement by coordinating the eco mode and core retention in memory intensive case; 10\% of speed-up with 17\% of power increase by the boost mode in CPU intensive case; and considerable power variations across over 20K nodes.




HUD-Oden: A Practical Evaluation Environment for Analyzing Hot-Water Cooled Processors


AUTHORS:  Jorji Nonaka, Fumiyoshi Shoji


KEYWORDS:  Energy efficiency, liquid cooling, hot-water cooling, power consumption, frequency throttling


ABSTRACT: Liquid cooling has been rapidly becoming the de facto standard cooling method for high performance/density racks of modern HPC/Data Centers. Semiconductor technology development has made it possible to operate processors (CPU, GPU, and Accelerators) at higher temperature ranges without compromising the reliability and static power consumption, and these contributed in part to increase the attention over the “hot water cooling” as one of the main approaches for energy efficient system design. The 2011 ASHRAE Class W4 allows water supply temperature up to 45oC, and even higher temperature for the Class W5. A clear understanding of the temperature impact on the processors (CPU, GPU, and Accelerators) would be valuable for assisting the HPC operational staff for their strategic planning and decision making. In this short paper, we present our experience using a simple and cost effective bench testing environment for analyzing the operational behavior of the processors in such high temperature conditions. Although it is far from ideal, since we are not using the same building blocks of the current running HPC system, we consider a valuable alternative for observing the operational behavior of the processors in such temperature environment, and may obtain supportive evidence for assisting strategic planning and decision making.




Global Experiences with HPC Operational Data Measurement, Collection and Analysis


AUTHORS:  Michael  Ott, Woong Shin, Norman Bourassa, Torsten Wilde, Stefan Ceballos, Melissa Romanus, Natalie Bates


KEYWORDS: exascale, Top500, HPC operations, energy efficiency, site survey, operational data, ODA


ABSTRACT:  As we move into the exascale era, supercomputers grow larger, denser, more heterogeneous, and ever more complex. Operating such machines reliably and efficiently requires deep insight into the operational parameters of the machine itself as well as its supporting infrastructure. To fulfill this need, early adopter sites have started the development and deployment of Operational Data Analytics (ODA) frameworks allowing the continuous monitoring, archiving, and analysis of near real-time performance data from the machine and infrastructure levels, providing immediately actionable information for multiple operational uses.


To understand their ODA goals, requirements, and use cases, we have conducted a survey among eight early adopter sites from the US, Europe, and Japan that operate top 50 high-performance computing systems. We have assessed the technologies leveraged to build their ODA frameworks, identified use cases and other push and pull factors that drive the sites’ ODA activities, and report on their operational lessons.




A Study of Operational Impact on Power Usage Effectiveness using Facility Metrics and Server Operation Logs in the K Computer


AUTHORS:  Masaaki  Terai, Fumiyoshi Shoji, Toshiyuki Tsukamoto, Yukihiro Yamochi


KEYWORDS:  power usage effectiveness, energy efficiency, co-generation system, K-computer


ABSTRACT:  The official service of the K computer ended in 2019. Most of the equipment except for servers are enhanced and continue to use as part of the infrastructure for the successor system named Fugaku. To ensure stable and energy-efficient operations in the next decade, understanding the facility behavior in the period of the K computer is valuable.


The K computer was powered by two energy sources: purchased electricity from a utility company and generated energy by gas turbine power generators on the premises. To evaluate the energy efficiency of the entire center, we use the modified power usage effectiveness (PUE) metric that considers different forms of energy source purchased from utility companies and shows the metric in the service period. To analyze the effect of operational impact on PUE, we use both the facility metrics and the server operation metrics extracted from the logs of the K computer. Further, using the three cases with the metric data, we reveal that some maintenance operations degrade PUE. Especially, the annual maintenance operations compared with emergency operations tend to affect the PUE metric. Finally, we show that there is an operational issue regarding the gas co-generation system as a preliminary study.




A Supercomputing Center Case Study on Cooling Control Design


AUTHORS:  Michael Kercher, Gary New


KEYWORDS:  data center, controls, cooling


ABSTRACT:  After designing and implementing an automated control system for a new HPC center, the National Center for Atmospheric Research (NCAR) elected to use a simpler operator-based solution. The solution has proven successful, and this case study documents the reasons for both the decision and the process used to choose it. Additional refinements to the cooling system controls are also documented and their adoption explained.






AUTHORS:  Joseph  Prisco, Grant Stewart, Herbert Huber, Randy Rannow, Jason Hick, Dave Martinez, Brandon Hong, Aditya Deshpande


KEYWORDS:  commissioning, electrical infrastructure, high performance computing


ABSTRACT:  The Energy Efficient High Performance Computing Working Group (EE HPC WG) has assembled a small diverse team to write a short investigative report on electrical commissioning. The purpose of the investigative report is to evaluate the need for electrical commissioning guidelines specific to High Performance Computing (HPC) data centers given their unique IT equipment load densities and power profiles. It is the consensus of the team that special electrical commissioning guidelines are needed and the EE HPC WG will author the initial guidelines. The scope of the guidelines will include the static and dynamic electrical aspects of commissioning practices that are specific to high performance computing and more importantly, cover the transient aspects of electrical commissioning. The fluctuating nature of many compute nodes can dramatically influence generation, transmission, and distribution of electrical power. HPC data center lessons learned and best practices will be examined and used to enhance the electrical commissioning guidelines. The primary audience for the guidelines is facility engineers and operators of HPC data centers. The guidelines will also be applicable to others that support HPC data centers, ranging from utilities and their electrical grid infrastructure to IT equipment manufacturers whose machines are being commissioned at the end of the process.



Lawrence Livermore National Laboratory

7000 East Avenue • Livermore, CA 94550

Operated by Lawrence Livermore National Security, LLC, for the Department of Energy's National Nuclear Security Administration

LLNL-WEB-670983  |  Privacy & Legal Notice | Site Search

Energy Efficient HPC Working Group

Energy Efficient HPC State of the Practice Workshop

Energy Efficient HPC State of the Practice Workshop  Presentations