This questionnaire gathers information from sites that have implemented data collection, aggregation and analysis for operational management of both facilities and HPC systems (including energy and power management) in a production environment on at least one large-scale system (Top500 sized system) with integration (or plans for integration) that extends from the HPC data center down through the platform to the CPU.
The Operational Data Analytics Team is focused on data collection, aggregation and analysis for operational management of both facilities and HPC systems (including energy and power management).
It has deployed an initial very short survey to identify sites with capabilities in this area who would like to share their experiences.
We will disseminate this information with a whitepaper and through other venues, like SC18 Birds of Feather.
Ghaleb Abdulla from LLNL, with help from colleagues from other sites (LANL, LBNL, NREL, PNNL, ANL), has led the development of this questionnaire. Some of the answers from these sites are provided for you to better understand what we are looking for from you and your site. Their responses are displayed directly under the question posed and marked by the site name e.g., “LLNL”, “LANL”.
LBNL: We have built a data collection system using the Elasticsearch database, Kibana, Grafana and other open source tools and a fairly extensive sensor network.
LLNL: We are using a commercial system called Pi. This is a time-series data base system.
NREL: We evolved our system overtime and some of the earlier remnants are still in place, but we mostly use InfluxDB as a time series data source and Grafana to look at it.
LLNL: 6 buildings (HPC facilities), plus 10s of other lab buildings (order ~100). Currently we monitor power from almost all buildings at the lab. We also collect data from equipment (mechanical and electrical) including utility infrastructure and weather data.
LANL: 2 buildings with HPC facilities. We collect reclaimed water plant information, as well as city water, flow and gallons per minute, but only for the HPC facilities and not the entire site.
LBNL: We are responding only for the HPC building which has its own instrumentation.
LLNL: We had large voltage dips from our electricity service provider. We measured 3-phase voltage from phasor measurement units (PMU) into one of the HPC center buildings. You could see a big dip in frequency and voltage. It was at a specific time of the day. Then we lost power from one of our feeders- we have redundant power that comes to the lab. We noticed this dip and we wanted to know why it happened. We have the weather data collected also. So, we looked at the wind data- we collect weather data including the wind speed. It was very clear that - particularly for that location where we lost power- the wind was over 70 mph and coming from a specific direction. So the interesting thing is that this happened again at a later time. In both events, the wind was coming from the same direction and it was again over 70 mph. We shared the information with electricity service provider.
Our energy company is asking us to give them a report every time there is a change in consumption above 750kW steps over a specific period of time. So now we have a dashboard for one of our engineers that they can use to monitor the site-wide loads and variability. We do the calculation with another tag and we send an email every time that it exceeds or it jumps the amount of energy that they are interested in. We want to see the data over a year so we can see it over various conditions, like the summer when the mechanicals are working harder to cool the machine. In the summer, we might have to shut some of the nodes down. This is why we have this system- to see why things are changing over time.
LBNL: This system is able to pull information from the BMS systems (e.g., pump speeds, flows), sensors through-out the building (e.g., temperature, humidity on pretty much every rack in the building). We have at least 4000 power sensors on the floor. There are 3000 temperature sensors. 2000 BACnet sensors. We have particle sensors, two sets of seismic sensors. We also have a weather station on our building from which we collect data.
We collect many things from the electrical system: power, current, phase and alarms. We can see real-time how the power is being fed into substations and each medium voltage feed into the building. We access the revenue meters as well as meters built into panels and PDUs.
LLNL: One of the chillers to Building 453 was shutting down and we didn’t know why. It was really simple to find out because we have this system in place. We had data collected over a long period of time and visualized the data over a year. We started zooming out and tried to see exactly what was going on. Once we looked at the data over a year, we noticed that the chiller was going up and down very frequently. It was a repetitive behavior. We realized that this chiller was a small one between two big chillers. This small chiller was designed for staging a bigger one in a planned step. But, especially in the summer, it was doing this over and over. It was a mechanical load that was on and off, on and off. We realized that the staging function wasn't working as designed. So, we eliminated it.
NREL: We look at PUE everyday. When that is looking funny, there is probably something going on. The way that the data center is architected, all of the heat from the devices in the data center end up in something called the energy recovery loop. So the water temperature coming out of the energy recovery loop is usually pretty constant - when this looks funny- we investigate.
When we were looking at the cooling loop, we noticed a cyclic behavior of the cooling towers. We were trying to figure out what was going on. It takes ~7 minutes for water to make a loop all the way around the energy recovery system. We have very large diameter pipes. We are moving about 250 gallons/minute right now. When they built the building- hopefully you can see with the pointer- the temperature that sensed the water coming out of the cooling towers was after these tanks. We have remote sump tanks, when we are not using a tower, the water is inside the building in these tanks. They put the temperature sensors on the output of those tanks, so we have this pretty incredibly damped system. You'd make a change to the tuning of the cooling towers and several minutes later that water would make it to a sensor. Double or triple damping as the algorithms were trying to figure out what was going on as well as water mixing in those tanks. We moved the sensor to the other side of the tanks and the system suddenly behaved as we expected. Gone were the 1-2 degree water swings that we were seeing pretty constantly. The system behaved much better.
LANL: We used the data to work with our department (we manage our own power) on power consumption forecasting, etc. We also give feedback to Architectural and Engineering firms for design.
LBNL: The most interesting thing we have right now is power data, which we try and keep on line as long as possible. We are focusing attention on power right now because we in the process of planning an upgrade for our next generation Cray, and need to coordinate our usage with the rest of the site and our power utility. Compulsively tracking usage has also allowed us to determine the number of substations actually needed in the upgrade, and has resulted in a substantial savings.
Shortly after we moved into the building, we had just installed Edison. Half a dozen cabinets tripped out on Edison. What we determined by going back and examining our meters, instead of relying on the ones in the substations, was that the power to the lab was running above the high- end cut-off (voltage) on the Edison power supplies. As a result, we were able to retrim the transformers to actually drop the voltage into the building.
We have also used the data to perform incremental improvements in plant operations, having saved ~2MWhr/year so far. This has resulted from several different improvements including (1) arbitraging cooling tower fans power versus pump power (it is more economical to run fans faster and pumps slower), and (2) optimizing settings of pipe bypasses to reduce parasitic pressure lost.
LLNL: For established buildings the challenges are: 1) maintenance of the current metering, 2) update the DB when a new machine arrives and an old one retires, 3) add the new data points.
LBNL: When the building was originally designed we didn't have a perfect set of meters to completely calculate fully accurate level 1 and level 2 PUE. We are in the process of putting these in place.
LANL: Adding neutron monitors, new air quality monitoring. Maybe add vibration sensors. We do have a D-wave machine that is quantum computing and it has a vibration sensor. In the future, if we get a big quantum computer, we will definitely be monitoring vibration. We also have plans to expand to a third data center.
LLNL: Three at this time, but we are adding more.
LBNL: Two Cray supercomputers. Cori, a Cray XC40, that has a peak performance of about 30 petaflops and Edison, a Cray XC30 with a peak performance of 2 petaflops.
NREL: One HPE supercomputer. Peregrine has a peak performance of 2 petaflops.
LBNL: We have the ability to reach into the two Cray supercomputers and pull out data about how Cray sees the environment (e.g., temperatures on CPUs) and we are starting to bring in application-level information (such as sys logs, application usage), we hope to get to job information. From the Cray system, we can see everything from processor temperatures, to airflow within the cabinet, water pressure going in to the cooling radiators and so-forth as a real-time stream.
Originally, we were pulling data from Cray's internal management system. We worked carefully with Cray to allow us to look at the data directly and we built an interface that allowed us to actually stream it out. We get the Cray internal information with a Docker based micro-service system that goes out and gets information from SNP devices off the network. We wrote a plug-in that actually diverts it before it hits the database on the Cray systems.
We try not to scrape out a database because past experience has shown that you always loose when you do that. It is very hard to make a reliable scraper.
We let Grafana handle time synchronization – a problem from sensors giving different values at different times. We timestamp the data when it comes in and also we timestamp it when it goes thru the system. Kabana and Grafana will bucket the times- average, max, min can be defined. For the last hour, the bucket might have been a 60 second slot. If there are 5 data points in that 60 second slot, you have to tell it which one to pick- max, min or average.
LLNL: We collect power at the rack level, CDU (flow and temperature of cooling liquid), aggregate cluster power (one of our machines had a dedicated meter), environmental data (temperature, humidity, etc. 3 sensors per rack).
We don’t collect HPC platform software data into this database. We are working on another data aggregation system that collects application, scheduler, file system data. One problem with the data from the HPC system (e.g., platform, rack, node, CPU, components) is that the sources of the data are enormous. Integrating HPC platform with facility data is really challenging because of the lack of a standard schema to link those two together. That's why I was very interested in hooking up with PowerAPI.
LANL: The data collection is very different for our Penguin CTS systems and our Cray systems. We use IPMI to get data from Penguin CTS and Cray gives it to us through a Cray specific protocol. With PMDB and SEDC data there is a schema set by Cray, but it is very complicated and not easily usable.
We collected everything we can from the cluster, syslog, job logs etc., but this is not yet combined with the facilities data. All of the job data, syslog data is sent to two places. Splunk and our data analytics system. We do lot's of alerts and correlation of the data there. The LDMS (Lightweight Distributed Metric Service) data, which is node level application data, is sent to another LDMS specific server and to our data analytics system. We've been trying to pull some of the facilities side data to Splunk and the data analytics system. We have little pieces of it everywhere, but our main two places are Splunk and our open-source based data analytics system. We use those to try and do correlations between the data.
A challenge is that the size of platform level data is very large compared to facility data. The facilities data is collected on the order one to 15 minute intervals. Whereas, once you get into the rack and pull the data from the HPC system, that is a much larger amount of data because the frequency is much higher. So platform-level is from 20 Gigabits to 4 Terabits per day worth of data. Whereas, below the floor stuff is about 1 Gibabit per day.
LLNL: In general we use it for benchmarking, research projects, energy efficiency studies. We also use platform level data for forecasting power for the utility provider (machine load, maintenance schedules, sudden large power consumption increase or decrease)
We did research with respect to liquid cooling vs. air cooling studies and used some of the IPMI and lower level data. We collected temperature, retired instructions, and other metrics from these tools and then we did some comparisons between the liquid cooled systems vs the air cooled systems. We tried to show the differences in performance and temperature and things like that. This is more on the research side and has to do with power-aware schedulers.
LANL: We’re trying to collect data – especially during peak runs like HPL- and predict future usages of even larger platforms such as Cross-roads at 30 megawatts. We’re trying to look at what the future may bring.
LBNL: Cray has built an infrastructure for monitoring the performance of the Lustre file system and identifying rogue agents within the file system. This generates graphs that can be added to the dashboard for reads/writes, etc.. This can be used to see which jobs are producing which loads on the systems for the compute, network and the storage system. We hope to be running that here soon.
LLNL: We have another system that collects application data and we are trying to link the two with an automatic feed. This is still experimental, but the analysis will be easier.
LANL: We are currently working with two co-teams to get their application data into this monitoring infrastructure. We see when they build an application, when they compile it and build it. When they finally run the job, we get additional logs directly from their application. We haven't yet made a schema for that, so although they are giving us this information, we might not actually understand what all of it means.
LLNL: We make sure that the sensors are calibrated and the readings are accurate (according to the manufacturer data sheets). For old meters we always try to validate the values using appropriate techniques. Some data are loaded to the DB monthly although they are collected every 30 minutes. We can have missing data, because of power outage, equipment failure, etc.
We have a done a controlled experiment where we collected all the power data we could from the rack, node and rack level, and we added all this data together. We tried to compare it to a dedicated meter that is measuring power for the same system. We even did the calculations for the power loss, because there is an efficiency loss due to the conversion from high voltage to low voltage. So we did all of these calculations. We still ended up with 20% difference between the aggregated data from the clusters and the external meter that reads the data. We couldn't actually resolve the two numbers. We don't know if this is an accuracy problem, or if there is something missing that we haven't been able to include in our calculations.
Characterizing the accuracy is hard and we do that as soon as we get a change to validate the measurement using another data source or a tool.
LANL: We will start yearly calibrations of flow meters. We use a test and balance firm to verify per platform (cluster) specific rather than building wide.
We're constantly looking at the sensor values. If we find a bad value, we go out and physically check if the battery on these wireless sensors needs to be changed.
We had a machine that was over-heating and the sensors inside the node varied by as much as 100%. So, they really weren't good at telling you the temperature, they were good at giving you a baseline and knowing if things change.
Our power meters vendors do not put really sensitive or pricey sensors in their components, but some revenue grade external power meters are very accurate. The sensors in the rack are ok, but we have some facility-based monitoring equipment and when we really care, we use the facilities sensors because they're really accurate.
We have multiple sensors and we try to correlate looking at the rack sensor, the room sensor, the node sensors and see 'what's in the ball park?'
LBNL: For building power, we have two separate graphs because of meter inaccuracies. The first is from an ION meter that is revenue grade and very accurate. Where you see the power aggregate of substation main breaker over Modbus, you'll find out that the accuracy of the breaker depends highly on the load. It is only really accurate when it is at 80-90% load. Aside from accuracy, there should be differences due to IR losses between the switchgear and the substation.
The weather station picks up the air into the cooling towers- that is why we placed it where we did. We did not place it above the building because the cooling towers would affect the weather station above the building.
LLNL: We make sure that we follow the security rules.
LBNL: We are not running classified systems, so we don't have to maintain a firewall. In this way, we are quite different from some other HPC centers. Also, organizationally, the facilities organization is considered a sub-contractor and we have a very good relationship. Without that, this wouldn't have been possible.
LANL: There has to be a system on that IPMI network that we're actually pulling information from. The admins would be worried about giving us access to those networks and actually pulling information off of these systems. The SEDC data flows internally, through internal Cray networks, so trying to find a way to get that information out of the Cray is difficult. We have to tie into their network somehow, or have them poke holes out of that network to send us data. Our team that wants this data, but the data is controlled by other teams. We have to find ways to make them happy and not afraid we're going to break something by getting all of that data.
The temperature sensors that we use are wireless and we're not allowed to have wireless anything in any of these buildings, but we got special exemption for the temperature sensors.
NREL: The good news is that I am deep in the dashboards. A lot of the code that is behind it I can speak deeply of. The bad news is that it is kind of a 9th hour and week-end project. It isn't as organized as it could be if it were a delivered effort.
LLNL: It is a challenge to get everyone to use this system. This system has been in place for over five years now. Before that, we had other systems that were fragmented. For example, one engineer has data that he collects from all the AC units. In fact, he still uses that system. He feels very comfortable using that system. We are getting his data into our system. Once we feel comfortable with his use-case, we'll try building a dashboard for him.
LBNL: Build it and they will come.
LLNL: The problem with this kind of work. We really want to analyze, notify and visualize. But we spend lots and lots of time working on the left with data acquisition, managing interfaces, managing data and figuring out whether sensors are working... It is a continuous struggle. In our case, it is a very dynamic system. So, for example, in a year or two, some of clusters will be retired and new ones will be introduced. This means that the power will be reassigned to new clusters, transformers will be reshuffled, the old data continues to be interesting and we want to keep it, but we have to create new tags and look at the new data in a different way.
LBNL: We are averaging 15-20K data points per second from the various systems. The database is currently 140 TBytes. It is around 500GBytes per day. It is all based on the message size. We went with Elastic search because we can scale it up linearly.
LLNL: We have 100K tags or sensor values that come into the system. It is a time-stamp, a tag and a value. The sensor values come through the interface to the database and we can compress them. So, for example, we have been doing this for over four years- connecting data from about 100K sensors - some of them they come like 60Hrz- we still did not fill-up 4TBytes of disk. It has very efficient compression algorithms. On top of that, because it is optimized for time series, it is really easy to reuse some of the system. It is not like a relational database management system where you have to have tables and indexing. You only have time - time is the only index. It is really lightweight in terms of the data.
LBNL: Our next goal is to put machine learning in the system so we can process the data and work toward automatic optimization of the plant.
Energy Efficient HPC Working Group
Operational Data Analytics
Tap here for menu
TUE and iTUE Team