More Powerful Chips Need More Cooling – Innovations in Data Center Cooling, Part 1.
Introduction
As an operator of colocation data centers, AQ Compute is committed to providing its customers with a location for IT equipment with the most efficient infrastructure possible. The rapid market growth of the data center industry and the increasing performance of IT technology will pose a challenge to the infrastructure of our data centers in the future, especially for cooling systems, so we are keeping up to date on the latest innovations in the field of data center cooling to be able to implement them as early as possible. In this article, we provide an overview of innovative products, development projects and concepts for cooling systems for data centers.
To ensure IT equipment (ITE) functions properly, cooling is one of the main tasks of data center facilities. The largest amount of heat is generated at the main processing units, CPUs and GPUS, which are experiencing an increase in their power densities during the last decades1, posing challenges to hardware manufacturers and data center facilities. Since the emergence of AI applications is correlated with this increase in power densities, predictions of a sustained growth of up to 65% by 2028 turns this topic into an important challenge in the data center industry2. The Research Institute of Sweden (RISE) ensures in its recent article titled “Generative AI must run using liquid cooling!” that liquid cooling will be necessary for GPUs in the future3, and suggests that forced convection in combination with optimized heat sinks and targeted liquid flows will be the norm. “’Thin air’ isn´t sufficient”, they state.
Air-cooled systems can no longer be operated efficiently from a processor cooling requirement of over 37 W/cm2 4, so for higher power densities, the use of liquid cooling in the white space seems indispensable. Given the “power race” that is defying Moore’s law, not only a jump into liquid cooling will be needed, but innovations in the ITE cooling arena will also be required to cope with the above-mentioned increasing cooling needs while remaining energy efficient to avoid the decrease of inlet temperatures and keep PUEs low.
When improving the cooling system of a data center, the entire path of the heat flow from processor cooling to the heat exchange with an external heat sink must be considered. In this article we differentiate between innovations at ITE and facility level. In Part 1 of this series, we look at innovations in ITE at rack level, and deal with the cooling of the chips in the sinks. Facility level innovations will be noted in Part 2 of this series.
The Thermodynamics
Servers can only be cooled up to a certain heat flow Q, which depends on the heat exchange surface and the cooling medium and is limited by the maximum chip temperature Tdie and the air inlet temperature of the server Tfluid,in (equation [1]).
Tdie being determined by the processor, the heat flow Qdie can be improved by decreasing the inlet temperature Tfluid,in, increasing the heat exchange surface and increasing the heat transfer coefficient α. Decreasing the inlet temperature leads to higher mechanical cooling ratios and hence to lower efficiencies. Increasing the heat exchange surface is limited by the space available within the servers and racks and by the fact that the dimensions of chips, processors and hardware are highly standardized (the machinery for chip manufacture is delivered by almost one single company5). The increase of the heat transfer coefficient can be achieved by an increase of the heat capacity of the fluid or by increasing the fluid flow as much as possible. Moreover, the hotter the inlet coolant (Tfluid,in) the hotter the outlet could be, and therefore excess heat with higher temperatures could be harvested for further reuse6.
Air cooled servers
For air-cooled servers, a “simple” better airflow management can make a difference. A higher air mass flow increases the energy consumption of the cooling system (fans need to run faster) but not as much as lowering the inlet temperatures, making it a way to face the increased power densities without decreasing operating temperatures.
At rack level, this principle is used in products in which air-cooled servers are attached to powerful rear-door heat exchangers or housed in a closed cabinet, where the heated air at the server outlet is drawn through a cooling coil with coolant inside. These systems can achieve higher power densities than conventional ones and extend the free-cooling ratio.
In the case of the enclosed cabinet, a recooler (fan + coil) can be mounted directly above, extracting the hot exhaust air from the servers and cooling it down using a cold coil. There is a closed air circuit per rack with reduced air paths facilitating a higher volume flow than with conventional systems. With these closed cooling systems, a heat output of up to 85 kW per rack is possible7 and a data center can be operated with an average Power Usage Effectiveness (PUE) of 1.11 or less8.
DLC: Direct-to-chip Liquid Cooling
Turbulator
With direct to chip cooling, the coolant is fed directly to a cold plate that is attached to the chip in the server. Assuming a laminar flow of the coolant with no or little turbulence in the cold plate, the coolant particles closer to the chip heat up first and a temperature stratification forms in the flowing coolant. The coolant becomes colder as it moves away from the chip and has therefore only absorbs a small amount of heat. After, or even when, the coolant exits the cold plate, flow deflections in e.g. piping connectors cause turbulence and the different coolant particles at different temperatures begin to mix, equalizing the temperature stratification and lowering the temperature level of the server outlet.
A better heat transfer can be achieved with a turbulent flow. In the turbulent flow, the particles that absorb the heat from the chip are constantly moving and no temperature stratification is formed: a lower temperature prevails at the heat-generating surface and the heat flow from the chip is increased. The more turbulent the flow, the more heat can be absorbed by the coolant, as the heat transfer coefficient increases9. Some innovative cold plate products rely on turbulators (see Figure 1) that specifically generate a flow within the cold plate as turbulent as possible, making it possible to see cold plates with a TDP of up to 2000 W10.
Figure 1- Example of turbulators. Source: Chilldyne11
Microconvective Cooling
Another approach to increase the heat absorption of liquids on the chip is to direct the cooling liquid frontal against the hot surface of the chip to be cooled. Innovative products direct the cooling liquid through an array of nozzles with a diameter of less than one millimeter, creating an array of fluid jets that flow directly onto the hardware to be cooled (similar to a pressurized water jet used to clean the facade of a building). The direct flow results in enhanced convection and hence a high heat transfer coefficient. After the fluid jets have hit the hot surface and cooled it, the heated liquid is pushed to the sides of the heat sink housing, collected at one point and directed outwards12 (see Figure 2). Conveniently aligning the nozzles, the jet can be aimed directly at thermal hotspots on the chips and cooling can be optimized13.
Figure 2: Jetcool’s technology explanatory render (left) and thermal resistivities compared to a “traditional” cold plate cooling14(right).
Another advantage of microconvective liquid cooling technology is that no cold plate with heat-conducting paste is attached to the chip. Instead, the cold plate is open at the bottom, as shown in the picture above, so that the cooling liquid has direct contact with the chip, eliminating the heat transfer resistances of the chip housing, the heat-conducting paste and the cold plate. This means that a much greater heat flow can be dissipated from the chip for the same temperature difference (increased α at the equation [1]).
Microfluidic Cooling
The technology of microfluidic chip cooling goes one step further. Numerous innovative projects are working on this technology, in which the cooling liquid is not only led to the chip but also through channels (also designated as “microfins”) etched or added into the silicon chip with a size of less than 100 micrometers. (see Figure 3) The channels either increase the heat transfer surface (and thus the heat flow from the chip to the cooling liquid), or the heat transfer point is positioned even closer to the heat-generating transistor in the chip15.
Figure 3 – Microsoft’s Integrated Silicon Microfluidic Cooling of a High-Power Overclocked CPU for Efficient Thermal Management. Conceptual image showing traditional cold plate-based cooling (a) compared with microfluidic cooling (b). Switching to microfluidic heatsink represents considerable reduction in thermal resistances as well as form factor reductions. (c) Microfins16.
Conceptually, the creation of the channels in the silicon is already planned during chip production, so innovative concepts with liquid cooling can be implemented through tunnels in the middle of the chip, which is why the technology is also referred to as embedded cooling. Prototypical innovative projects range from microfluidic channels on and under the chip with cooling on both sides17 to chips with co-designing of microfluidics and electronics. The latter provides chips with a monolithically integrated manifold microchannel cooling structure, whereby the cooling channels are routed close to each individual transistor. Experimentally, a cooling capacity of 1.7 kW/cm2 can be achieved18. Other concepts envisage the development of tower chips, which consist of stacked chips connected by “through chip vias” (TCVs). For this purpose, copper connections are routed through the chips, connected via the shortest route with very low latency. Due to the short distance between the coolant and the heat-generating transistor (resulting in low heat transfer coefficients), this technology can reduce flow rates and operate at higher temperatures (achieving temperatures as high as 80ºC, very attractive for excess heat recovery19). Furthermore, patents can be found for microfluidic channels used as a boiler unit for two-phase fluids, for a further increase in efficiency20.
Immersion Cooling
Single-phase Immersion
With single phase immersion cooling, the servers are immersed in a tank filled with dielectric coolant, which cools the servers. Conventional immersion systems are driven by natural and forced convection, which means the immersion fluid heats up, expands and rises due to the resulting lower density, is then cooled, causing the density to rise and the cooled immersion fluid to sink. Innovative products supplement natural convection with additional circulation of coolant. Experiments have shown that an optimally controlled volume flow provides a more efficient system21.To enable chips to be optimally cooled, heat sinks are in some cases mounted on them, via which the heat flows from the chip to the immersion fluid. Several projects deal with the optimization of heat sinks in terms of their design including fin height, fin spacing, thermal conductivity, height ratio of baffle, and outlet area as well as their process parameters such as inlet flow rate and inlet temperature to increase heat dissipation22.
Some innovations in the single-phase immersion field are to be found in any case around the creation of new fluids with better dielectric heat transfer properties and material compatibility properties while staying environmentally friendly and keeping the costs controlled. The significant number of fluids that are currently on the market pushes some manufacturers to become fluid agnostic (Submer23).
Another innovation area is around the serviceability of immersion cooling. While some manufacturers have bet on a robotic immersion cooling system (such as TGC Otto suggested some years ago24), more manufacturers have bet on enclosed chassis structures, whereby immersion fluid is enclosed within separated server units. This allows the operator to take out a server (which can potentially remain as a classic 19” unit, coexisting with differently cooled servers within the same rack) and unmount it outside of the white space in a designated unmounting area (such as Liquid Cool Solutions25). This method is also suitable for hybrid cooling, combining cold-plate and immersion within the same server unit. Direct-to-chip is used to cool the processors and accelerators, and, in addition, the server is filled with immersion fluid, which takes care of all other heat-emitting components of the server26.
Spray Cooling
Innovative products that use spray cooling have recently become available on the market. As with immersion cooling, dielectric liquids are used, which are sprayed with a nozzle onto the hot areas of the server (e.g. CPU, GPU etc.). The liquid makes contact with the heat sink and achieves higher heat transfer coefficients compared to air and some conventional immersion systems (Airsys/Advanced Liquid Cooled Technologies (ALCT)).
By mounting the servers at a downward angle, the coolant can drain downwards and is collected in a tray (see Figure 4). The collected coolant is then re-cooled in a closed circuit and sprayed back onto the hot areas of the servers. Racks with spray cooling enable a power density of 100 kW per rack and data centers can be operated with a cooling PUE lower than 1.02 during year-round free cooling (as stated by Airsys/ALCT)27.
By mounting the servers at a downward angle, the coolant can drain downwards and is collected in a tray (see Figure 4). The collected coolant is then re-cooled in a closed circuit and sprayed back onto the hot areas of the servers. Racks with spray cooling enable a power density of 100 kW per rack and data centers can be operated with a cooling PUE lower than 1.02 during year-round free cooling (as stated by Airsys/ALCT28).
Figure 4 – Spray-based immersion cooling. Source: Airsys
Two-phase Immersion
Two-phase immersion cooling works in a similar way to single-phase immersion cooling, with the difference that the immersion fluid evaporates on the chip and more heat can be transported due to the phase change. Innovative projects are working on achieving a cooling capacity of up to 2 kW TDP by improving the heat transfer coefficients especially for boiling liquids through a coral-like heat sink design with internal groove-like features. On the other hand, coatings are used to facilitate the nucleation of bubbles, thus increasing the boiling of the coolant and improving heat transfer29.
Innovative Research projects go one step further and use a two-phase-spray for cooling. Using the same principle, the cooling liquid is sprayed onto the hot spots of the servers. The two-phase-fluid, such as PF 5060 with a boiling point of 56°C, evaporates on the cooling body of the chip and is then condensed again by a re-cooler, which should enable year-round free cooling, even in the tropics. Prototype laboratory tests have shown that this technology can be used to operate data centers with a cooling PUE of up to 1.0830.
Water Immersion Cooling?
Further optimization potentials of the two-phase immersion cooling technology exist in the selection of the immersion fluid. One of the problems with current fluids is their potential toxicity and global warming potential (GWP). Most of them are concerned by the upcoming regulations against F-gases31. A natural fluid that has been used in innovative prototype projects for immersion cooling of power electronics32 is deionized water, which has none of these negative propertiesiii. Due to the high corrosiveness of deionized water, the electronics are covered with a coating. Future research projects could look at what a water immersion solution for servers could look like.
Other Innovations and Trends
As computing hardware evolves to deal with the needs of high-performance computing, innovations in cooling technologies become essential. One approach is the usage of superconductors, which offer zero electrical resistance at extremely low temperatures, hence drastically reducing heat dissipation33. Cryogenics is the cooling technology allowing these low temperatures, using substances like liquid nitrogen or helium. For quantum computing, temperatures around 20-100 millikelvin are required34.
This trend for lower temperatures can also be found on current general-purpose hardware as well. The increasing TDPs are mostly being countered with lower temperatures when cooled with air. At least based on the requirements of the hardware manufacturers, which are prescribing lower admissible maximum temperatures35. As shown in Equation [1], unless the heat exchange area or the heat transfer coefficient can be increased, reducing temperatures remains the only way to face the increase of the TDPs , leading to higher energy consumption36. To avoid this, newer innovations such as the ones shown in this article will need to appear and be developed (in the cooling arena but also in the hardware field).
In Conclusion
Air-cooling systems are running out of cooling capacity for the ever-increasing power densities of the processors, and we will most likely witness the moment when liquid-cooled systems are installed at scale. The examples we have shown in this article are mostly at the forefront of high-density cooling and will have to be further tested before we can see them as a norm. Regulation around F-gases37 will influence the adoption of new fluids and consequently new adapted cooling systems (refer to our future articles to understand the challenges of this regulation). The evolution of semiconductors manufacturing, hence computers architectures and packaging, will influence the overrun of current innovations with a next wave of innovations creating new computing paradigms, which will most likely shuffle the ITE cooling deck again. Finally, hardware and ITE cooling innovations have their consequences on how the cooling facility is designed. The development of more efficient cooling techniques, targeting longer free-cooling periods, better cooling controls, and less water consumption are some aspects of these innovations. In the end, ITE is the reason why data center facilities exist, and some are putting pressure on the manufacturers to design “recommended servers” that can bear high temperatures allowing “global free cooling”38 (see Figure 5), especially given the fact that data centers will be massively built in non-temperate regions with higher ambient temperatures in the coming years.
Figure 5 – Global maps of annual free-cooling ratio at different space temperaturesxxxvii
Download the Full Article
[1] Emergence and Expansion of Liquid Cooling in Mainstream Data Centers, ASHRAE Technical Committee 9.9 Mission Critical Facilities, Data Centers, Technology Spaces, and Electronic Equipment
[2] https://www.soprasteria.se/aktuellt/generative-ai–a-$100bn-market-by-2028-according-to-sopra-steria-next/
[3] https://www.ri.se/en/news/blog/generative-ai-must-run-using-liquid-cooling
[4] https://www.sciencedirect.com/science/article/abs/pii/S0360544220324804
[5] https://www.firstpost.com/world/asml-holdings-dutch-company-that-has-monopoly-over-global-semiconductor-industry-12030422.html
[6] https://www.opencompute.org/documents/20230623-data-centers-heatreuse-101-3-2-docx-pdf
[7] https://ddc-cabtech.com/tierpoint-selects-ddc-cabinet-technology-to-propel-ai-workloads-in-ultra-high-density-data-center/
[8] https://www.scalematrix.com/data-center-efficiency
[9] Patil, Pranit M. et al. “Comparative Study between Heat Transfer through Laminar Flow and Turbulent Flow.” (2015).
[10] https://chilldyne.com/2024/01/22/chilldyne-secures-arpa-e-coolerchips-award-to-improve-data-center-efficiency/
[11] https://arpa-e.energy.gov/sites/default/files/2023-11/Day1_06a_Chilldyne.pdf
[12] https://jetcool.com/technology/
[13] https://www.intel.com/content/www/us/en/newsroom/news/intel-dives-into-future-of-cooling.html
[14] https://eepower.com/news/microconvective-cooling-produces-very-high-heat-transfer-coefficients/
[15] https://www.datacenterdynamics.com/en/analysis/microfluidics-cooling-inside-the-chip/
[16] S. Kochupurackal Rajan, B. Ramakrishnan, H. Alissa, W. Kim, C. Belady and M. S. Bakir, “Integrated Silicon Microfluidic Cooling of a High-Power Overclocked CPU for Efficient Thermal Management,” in IEEE Access, vol. 10, pp. 59259-59269, 2022, doi: 10.1109/ACCESS.2022.3179387.
[17] https://www.izm.fraunhofer.de/en/abteilungen/wafer-level-system-integration/exhibits_posters/microfluidic_cooling.html
[18] https://www.nature.com/articles/s41586-020-2666-1
[19] https://www.datacenterdynamics.com/en/analysis/microfluidics-cooling-inside-the-chip/
[20] https://www.tno.nl/en/digital/semicon-quantum/microfluidics-high-performance-thermal/
[21] https://www.sciencedirect.com/science/article/abs/pii/S001793102300176X?dgcid=raven_sd_recommender_email
[22] https://www.sciencedirect.com/science/article/abs/pii/S1359431123001096
[23] https://submer.com/immersion-fluids/
[24] https://www.youtube.com/watch?v=qWOBznmtdT8
[25] https://liquidcoolsolutions.com/immersion-cooling-technology/
[26] https://www.forbes.com/sites/stevemcdowell/2023/06/01/the-innovative-cooling-approach-behind-nvidias-5m-coolerchip-grant/?sh=203222bf157a
[27] https://advancedliquidcooling.com/liquidracktm-spray-immersion-racking-system/
[28] https://airsysnorthamerica.com/liquidrack/
[29] https://www.intel.com/content/www/us/en/newsroom/news/intel-dives-into-future-of-cooling.html
[30] https://www.sciencedirect.com/science/article/abs/pii/S0306261921011466
[31] https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32024R0573
[32] https://www.sciencedirect.com/science/article/abs/pii/S0017931019336002
[33] https://www.imec-int.com/en/articles/superconducting-digital-technology-revolutionize-ai-and-machine-learning-roadmap
[34] https://www.azoquantum.com/Article.aspx?ArticleID=472
[35] https://www.motivaircorp.com/news/what-is-the-data-center-thermal-squeeze-/
[36] ASHRAE Technical Committee 9.9, Mission Critical Facilities, Data Centers, Technology Spaces, and Electronic Equipment. “Emergence and Expansion of Liquid Cooling in Mainstream Data Centers”.
https://www.ashrae.org/file%20library/technical%20resources/bookstore/emergence-and-expansion-of-liquid-cooling-in-mainstream-data-centers_wp.pdf
[37] https://echa.europa.eu/de/-/echa-publishes-pfas-restriction-proposal
[38] Yingbo Zhang, Hangxin Li, Shengwei Wang, “The global energy impact of raising the space temperature for high-temperature data centers”, Cell Reports Physical Science, Volume 4, Issue 10, 2023, 101624, ISSN 2666-3864, https://doi.org/10.1016/j.xcrp.2023.101624