Research Highlight: A Machine Learning Cooldown for the Data Center

Posted by Rob Mitchum on August 25, 2016

In recent years, the race to build the fastest computers has been joined by a parallel competition to design the most energy-efficient machines. The colossal data centers supporting cloud computing and web applications consume massive amounts of energy, using electricity to both run and cool their tens of thousands of servers. As engineers look for new CPU designs that reduce energy usage, scientists from Northwestern University and Argonne National Laboratory are seeking an AI-based solution, using the cloud computing testbed Chameleon to reduce power through smarter task traffic.

The collaboration, COOLR, was created to discover new methods of reducing the energy and cooling costs of high-performance computing. New research from Northwestern PhD student Kaicheng Zhang tested whether smarter task placement -- the assignment of computing jobs to specific servers in a cluster or data center -- could successfully reduce system temperature and the need for cooling. Using Chameleon as a test environment, Zhang and colleagues Gokhan MemikSeda Ogrenci Memik, and Kazutomo Yoshii explored how machine learning can help make decisions about task placement that conserve energy without sacrificing performance.

Not all computer programs are created equal, in energy terms. While running some code is computationally intensive and generates lots of heat, other applications run efficiently and keep computers cold. Individual nodes can also vary in their heat production, particularly in servers which use heterogenous architectures. Yet while engineers increasingly consider total power consumption when tasks are spread out amongst multiple CPU nodes, the use of temperature to assign tasks is a new concept.

“We found that for different machines, even with same power setup, they have variations in power consumption and temperature,” said Zhang, a graduate student studying Computer Engineering at Northwestern. “So this difference and variation can be exploited intuitively by putting high-demand applications on cooler nodes, and low-demand applications on hotter nodes, to balance the peak temperature and improve performance.”

At first, the COOLR team tried out an algorithm for this temperature “load-sharing” on small two-node systems, where the decisions were relatively simple: predict the temperature generated by app A and app B if run on node A or node B, then select the lowest temperature result. But as the system scales up to 4, 8, 16 nodes or beyond, the problem becomes far more complex, requiring a more sophisticated approach.

“To be able to show that this is practical, we needed a larger-scale system,” said Memik, Professor in the Computer Engineering Division at Northwestern. “In this case, the machine learning algorithm shines when the system goes large scale and it’s harder to use human knowledge.”

To test these algorithms, the researchers needed to compare the performance of different assignment schemes for randomly-selected groups of applications. First, they ran a group of “benchmark” applications on a real system to measure the power consumption and temperature across the nodes. They could then use this information to drive simulations of how much energy demand each different task-assignment scheme would use in running the same applications. Those schemes predicted to be most efficient in simulation could then be tested in practice, to confirm the algorithm’s recommendations.

However, commercial cloud frameworks do not allow users to access the hardware-level statistics on power consumption and temperature that the researchers required to train their model. Enter Chameleon, a large-scale experimental testbed, which grants users the bare-metal access the COOLR team needed.

“What they are doing requires access to resources deep in the operating system to pull out power consumption information, as well as isolation so that you know that your experiments are not affected by what other users are doing” said Kate Keahey, CI Senior Fellow and principal investigator on the Chameleon project. “In a typical resource, that’s something that you can’t do, but Chameleon has been specifically designed to support it.”

Chameleon also allowed COOLR to configure their own custom framework for experimentation. In the most recent experiments, they built a 16 node system with two processors on each node, allowing them to build and test their algorithms on a real-world scale. Unlike a commercial data center, the researchers could also ensure that nobody else was using the same resources simultaneously -- protecting the accuracy of their energy measurements -- and could return to the same nodes repeatedly for re-testing and reproducibility purposes.

After using Chameleon to test hundreds of different task placement schemes for 32 different workloads of randomly-selected applications, the team found that the best possible assignments reduced power consumption by nearly 5 percent, compared to the average of all schemes together. When they looked solely at fan-power consumption (the amount of energy needed to cool the system), they saw an even larger reduction of 17 percent in the optimal scheme.

This data was then used to train their machine-learning model, which tests one task switch at a time until it determines the most energy-efficient assignment schedule. Re-testing the random workloads using their algorithm to determine the best task placement scheme, they found an average of 12.3 percent fan power reduction with the “machine-recommended” strategy. Even at the level of a 16-node cluster, those energy savings would be significant over long-term use, and could be boosted further with downstream effects.

“If we can reduce the power we not only reduce the budget on the cooling costs, we can also reduce the future costs when a data center needs to purchase the overall cooling system,” Zhang said.

The work runs parallel to experiments publicized recently by Google, who applied their “Deep Mind” artificial intelligence to find ways to reduce energy consumption at their data centers. The COOLR team said their work could potentially complement the recommendations of Google’s model, creating a node-up strategy for reducing energy and heat to accompany Google’s factory-scale insights. But their open science approach, in contrast to Google's proprietary research, will allow other scientists to use and improve their machine learning methodology.

“In its essence, their goal is very similar -- to try to come up with smart intelligent mechanisms that are not human-controlled to reduce energy consumption,” Memik said. “They can co-exist very easily.”