"EE HPC WG Liquid Cooling Controls Team Whitepaper. June 11, 2017" This paper defines data inputs for dynamic controls to manage high performance computing (HPC) facility and IT control systems. Each input includes parameters about measurement frequency and accuracy that are within a rough order of magnitude, but not an absolute limit. Each input also includes information about whether it would typically be provided by the facility or by the HPC system or whether its provision would have to be negotiated. This document is intended to be a guideline for data inputs to consider when designing a liquid cooling control system. It is not a design specification. Each site will develop their specific design based on their specific situation.
There are lessons learned and best practices evolving from implementing and operating supercomputer centers with complex infrastructure systems and the highly variable demands placed upon these systems with today's supercomputers. The Liquid Cooling Control Team initially focused on sharing designs, challenges and best practices for integrated control systems.
The team transitioned from this initial charter and started generating a list of data elements required for dynamic, integrated liquid cooling controls. The team is also collecting information on use cases to test and build support for the initial list of data elements. The results of this work will be captured in a whitepaper. It is also expected that the results of the whitepaper will be included in the EE HPC WG Energy Efficiency Considerations for HPC Procurement Documents.
CONTROLS DESIGN, CHALLENGES AND BEST PRACTICES REVIEW:
The team initially shared designs, challenges and best practices (see below)
After sharing this information, the team compiled a list of all of the liquid cooling controls challenges, concerns, issues and opportunities that were identified both in the presentations and as a result of the review discussion. The team then synthesized this information and identified top problems and recommended next steps (see below).
Top problems with control systems:
We don't have direct interfaces to cooling systems as they are managed by another organization (building management).
We don't have access to the data or integration of different systems.
We don't have direct access to command and control systems.
Financial constraints, opportunity to modify the control systems is not there (organizational or political).
Integration of IT and infrastructure.
A whitepaper on a minimum set of data that should be tiered across systems and why that is important.
What would be the ideal scenario that you’d like to get to.
Where everyone should be today and where we should be as a long-term vision: information between IT and facilities. Or maybe good, better, best.
How to integrate control systems for liquid cooling and IT systems.
Control architecture: 3 tiers (servers on the floor, supervisory controls, alarms)
CONTROLS HIGH-LEVEL GUIDELINE OUTLINE:
Transitioning to the generation of a whitepaper, the team has pursued two slightly differing approaches.
First, the team wrote an outline for the whitepaper. The outline was for a high-level Guideline of HPC and Data Center Controls Systems as well as an addendum for Sequence of Operations. The outline ended up to be a document that was 5 pages long. It was considered too broad for the team to embrace immediately and was tabled for future consideration.
Secondly, the team created a list of data elements deemed important for liquid cooling controls. These data elements are from both the IT systems and the data center building. This work is exploratory, as there are few implementations of dynamic integrated liquid cooling controls.
DOCUMENT: Please request latest copy from Natalie Bates
SYSTEM INTEGRATOR CONTROLS VISION AND ROAD MAP:
In order to test this list of Controls Data Elements, the team decided to both ask the system integrator vendor community to present their vision and roadmaps as well as to write case studies on the few sites that have implemented dynamic liquid cooling controls. Below is an excerpt from the invitation sent to the system integrators for a webinar where they would make their presentations.
Although there aren't any specific presentation format requirements, the expectation is that each presentation will address the some or all of the following areas of interest for the HPC centers. It is recommended that you start with a block diagram describing your liquid cooling and control technology.
Background: Many HPC liquid cooling systems are operated with fixed parameters, like temperature and flow rate. There may be energy savings opportunities for these parameters to vary, based on real-time load changes.
What are your new and upcoming technologies that would provide liquid cooling controls to promote energy efficiencies? What controls systems would you like to see built into the HPC center building infrastructure?
Another opportunity may exist to manage the loads with resource and job scheduling capabilities. What are your plans, if any, to provide load management capabilities?
Presentations were made by HP, Cray, Lenovo, IBM and RSC. Below is a summary of these presentations.
Today’s state of the practice is to use commercial Coolant Distribution Units (CDUs) for managing the delivery of liquid to the HPC system. Most of the CDUs deployed with today’s HPC systems are constant flow-rate and temperature. The customer can set (and change) inlet flow-rate and temperature as long as they stay within a specified envelope. This envelope is set based on a maximum specified heat removal. Some customers re-set these points on a seasonal basis and could re-set them as dew point changes. There are CDUs deployed at different levels of the system with the lowest level being the rack, but not the node level. These CDUs range in their intelligence, but at least one vendor claims to have intelligent CDUs with integrated controls (specifics on the control was limited).
Tomorrow’s products could be designed to allow for inlet flow-rate and temperature to vary based on actual, not maximum specified heat removal. There would have to be more and finer grained telemetry and controls.
There are many questions that were raised, some business and others technical.
What is the value proposition? How much savings can be gained in energy savings costs compared to the incremental capital and additional operational costs (e.g., increased components decreases reliability and adds to maintenance costs)?
Application differences cause racks to vary in power and, hence, heat removal requirements, but the nodes within a rack can also vary. Where is the sweet spot for implementation, at the rack, node or even component level? What is the sweet spot for response time, which may be different for the system and the facility?
Energy and power aware scheduling for fine grained (component) application tuning as well as coarse grained (system) power capping is currently in the proto-type stage for at least one vendor and in future plans for others. Could these capabilities be used to balance heat removal requirements within the system?
These are some of the outstanding questions that we are hoping will be answered with more data and analysis moving forward.
This forum will be followed by another webinar where some supercomputing center members of the EE HPC WG will disclose their thoughts, plans and expectations for HPC liquid cooling controls requirements.
The team is hosting this webinar to encourage communication between system integrators and users regarding liquid cooling controls. They are also trying to encourage participation and especially looking for more examples of liquid cooling controls. Below is the plan for this webinar.
Introduction and Motivation:
Can we reduce operational expenditures and improve energy efficiency by optimizing liquid cooling systems with more dynamic controls?
What is the liquid cooling energy cost? How much power is going into liquid cooling?
HPC loads can vary by multiple megawatts, both intra and inter-hour variation. What are the site’s experiences? Is there a wide difference in site experiences?
Environmental conditions also vary, which allows for water to be cooled with greater or lesser amounts of energy and/or for the water to be cooled to varying temperatures. What are the site’s experiences? Is there a wide difference in site experiences?
What other factors might drive for more dynamic controls- e.g., water conservation?
Where are liquid cooling controls best implemented- in the HPC system or in the building or both?
What other strategies could be implemented to minimize intra and inter-hour power load, such as power managed job scheduling.
Can we develop recommendations for implementation to reduce control system complexity for sites with multiple HPC systems?
David Martinez, SNL’s Sky Bridge
Improved reliability/stability while maintaining energy efficiencies. Components & control systems would work in unison to prevent wasted energy (hunting, etc.) and reliability improvement would be seen at the compute level (running codes, etc.)
The current state of facilities controls operating in a reactive state would diminish with integration of integrating the facilities system. This will create a proactive control of the environment and transition to a predictive state where there will be no delay in response time.
Tom Durbin, NCSA’s Blue Waters
This study provides verification of savings resulting from implementation of controls that utilize variable frequency drives to minimize energy use in chilled water pumps.
Greg Rottman, ERDC
Four modes of operation: normal, pre-cooling, comprehensive cooling, and load balance