Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

01

Overview of Chip Cooling

Origin of Chip Cooling: The essence of electronic device heating is the conversion of work energy into thermal energy

The fundamental reason for the heating of electronic devices is the process of converting work energy into thermal energy. As the core component of electronic devices, the basic working principle of chips is to convert electrical signals into various functional signals, achieving data processing, storage, and transmission functions. During the completion of these functions, chips generate a large amount of heat because the transmission of electronic signals is accompanied by energy losses such as resistance, capacitance, and inductance, which are converted into thermal energy.

Excessive temperature can affect the performance of electronic devices and even lead to damage. According to the “Research Status and Development Prospects of Electronic Chip Cooling Technology”, for electronic chips that require stable and continuous operation, the maximum temperature should not exceed 85 °C, as high temperatures can cause chip damage.

Cooling technology needs continuous upgrades to control the operating temperature of electronic devices. The continuous development of chip performance increases power consumption and raises higher requirements for cooling technology. Additionally, the training and inference demands of AI large models require an increase in the computing power of AI chips, which is expected to further open up growth space for advanced cooling technologies.

Analysis Report on the Chip Cooling Industry Chain

Principles of Cooling Technology: The essence of electronic device heating is the conversion of work energy into thermal energy

Cooling is designed to address thermal management issues in high-performance computing devices, optimizing device performance and extending lifespan by directly removing heat from the chip or processor surface. With the increase in chip power consumption, the development has progressed from one-dimensional heat pipes with linear temperature equalization, to two-dimensional vapor chambers (VC) with planar temperature equalization, and finally to three-dimensional integrated temperature equalization, namely 3D VC technology, and ultimately to liquid cooling technology.

Chip Cooling Innovations: Immersion cooling is effective, while cold plate cooling is more mature

According to the ODCC “Cold Plate Liquid Cooling Server Design White Paper”, considering factors such as initial investment cost, maintainability, PUE effect, and industry maturity, cold plate and single-phase immersion cooling technologies have advantages over other liquid cooling technologies and are currently the mainstream solutions in the industry.

02

Main Cooling Technologies

Heat Pipes: Efficient heat transfer devices suitable for high power and compact spaces

Heat pipes, also known as Heat Pipes, are efficient heat transfer devices. They can quickly transfer heat from one end to the other through the phase change process of an internal working fluid. Their structure is simple, consisting of a sealed container, capillary structure, and working fluid. Heat pipes have high thermal conductivity, temperature uniformity, and isothermal characteristics. They are used in high-power chips and products with limited cooling space, such as laptops, servers, gaming consoles, VR/AR, and communication devices.

Vapor Chambers (VC): Higher thermal conductivity and flexibility compared to heat pipes

The vapor chamber, or Vapor Chamber, is a more advanced and efficient thermal conduction element than heat pipes, especially excelling in managing the heat of high-density electronic devices. Compared to heat pipes, VCs offer greater thermal conductivity and flexibility. The thermal conductivity of copper is 401 W/m.k, heat pipes can reach 5000-8000 W/m·k, while vapor chambers can achieve 20000-10000 W/m·k, or even higher. Heat pipes conduct heat in one dimension, limited by their shape, while vapor chambers can be designed in any shape according to the chip layout, even accommodating multiple heat sources at different heights.

Room Air Conditioning: Water-cooled air conditioning is more effective than air-cooled systems

Air-cooled direct expansion systems are air conditioning systems mainly used for cooling and heating in small to medium-sized buildings or individual rooms. The refrigerant is generally Freon, with a cooling capacity of 10-120 KW per unit. Water-cooled systems are central air conditioning systems that use water as a cooling medium to transfer heat. These systems typically consist of chillers, cooling towers, water pumps, and pipelines, widely used in large buildings.

Liquid Cooling: Cold plate and immersion cooling are the main types

Server liquid cooling is divided into direct cooling and indirect cooling, with immersion cooling being the primary method for direct cooling and cold plate cooling for indirect cooling. In cold plate liquid cooling, the cooling liquid does not directly contact the server components but exchanges heat through the cold plate, hence it is called indirect liquid cooling. Based on whether the cooling liquid undergoes a phase change in the cold plate, it can be classified into single-phase cold plate liquid cooling and two-phase cold plate liquid cooling. Immersion cooling involves directly immersing the entire server or its components in a liquid coolant.

Cold Plate Liquid Cooling: Requires server modifications, penetration rate gradually increases

Cold plate servers require modifications to the server’s piping and structure: for example, Inspur’s 2U four-node high-performance computing server i24 adds multiple cold plates in contact with heat-generating units such as CPUs, I/O, and memory, and sets up multiple pipelines to connect internally with the cold plates and externally with cabinet-level distribution pipelines, achieving approximately 95% of the system’s heat being removed directly by the liquid through contact with the cold plates, while the remaining 5% is removed by the cooling water in the liquid-heat exchanger behind the PSU.

Cold plate liquid cooling servers modify the original server structure, considering factors such as responsibility allocation and assembly methods. Major players believe that original server manufacturers will take the lead; server manufacturers will procure raw materials such as cold plates and pipes, and then assemble them for production. The average price of cold plate liquid cooling servers may be higher than that of air-cooled servers, but as penetration rates increase, server manufacturers are expected to achieve growth in both volume and price, along with profitability.

Immersion Cooling: Entire server immersed in liquid, high technical requirements

Immersion cooling is a method of cooling that directly immerses the entire server or its components in a liquid coolant. The liquid completely surrounds the server components, thus absorbing and dissipating heat more efficiently. Depending on whether a phase change occurs during the engineering liquid cooling process, it can be divided into single-phase immersion cooling and two-phase immersion cooling. Immersion cooling servers require multiple design modifications, including shell design, motherboard modifications, cooling system upgrades, and sealing, which have high technical requirements and are mainly produced by server manufacturers.

03

Market Space

Driver 1: Chip protection safety, temperature control is beneficial for maximizing chip performance

Excessive chip temperature can affect device performance and even lead to damage. According to “Cabont e ch Magazine”, when the temperature of electronic devices is too high, performance can significantly degrade. When the operating temperature of a chip approaches 70-80 °C, for every 10 °C increase, the chip’s performance can decrease by about 50%. More than 55% of electronic device failures are caused by excessive temperature. We believe that with the development of AI large models and the improvement of chip performance, chip power consumption and operating temperature are on the rise, which may affect the efficiency of processors and others. This raises higher requirements for chip-level cooling technologies, which are expected to open up growth space and achieve simultaneous increases in volume and price.

Driver 2: Development of AI large models + growth in chip performance, continuous increase in chip power consumption

In servers, the power consumption of CPU and GPU chips is relatively high. According to the “Research Progress on Power Consumption Models of Data Center Servers”, the power consumption of CPU, memory, storage, and other components in general servers accounts for 32%, 14%, and 5%, respectively. AI servers have a heterogeneous structure of “CPU + GPU”, with high GPU power consumption driving overall server power consumption. For example, the power consumption of NVIDIA’s H100 GPU reaches 700W, and the maximum power consumption of the DGX H100 server is 10.2kW, with GPU power consumption expected to account for about 55% of total server power consumption. Chip power consumption continues to rise: for instance, Intel’s Ice Lake CPU has a maximum power consumption of 270W, and the Granite Rapids CPU expected to be launched in 2024 is anticipated to have even higher power consumption. In 2024, NVIDIA will release the B200 GPU, with a power consumption of 1000W. As chip performance improves and AI large models develop, the power consumption of CPU and GPU chips is expected to continue to rise, leading to a significant demand for advanced cooling devices.

Driver 3: Policies such as “dual carbon” and “East Data West Computing” require reducing data center PUE

PUE = Total energy consumption of the data center / Energy consumption of IT equipment. PUE is the core indicator for evaluating the energy efficiency of data centers; the closer the value is to 1, the higher the energy efficiency of the data center. The energy consumption of air conditioning systems in data centers ranks second only to IT equipment. When IT systems cannot be upgraded, reducing the energy consumption of air conditioning systems is a crucial step. When the energy consumption of air conditioning systems decreases from 38% to 18%, the PUE of the data center also drops from 1.92 to 1.3.

Policies such as “dual carbon” and “East Data West Computing” require reducing data center PUE. According to the Uptime Institute, as of 2022, the average PUE of large and medium-sized data centers worldwide is 1.55. According to the “China Data Center Industry (Ningxia) Development White Paper (2022)”, the average PUE of IDC nationwide in 2021 was 1.49. Under the dual policies of “dual carbon” and “East Data West Computing”, the average PUE of newly built large and super-large data centers nationwide is expected to drop below 1.3, with cluster PUE requirements of ≤1.25 in the east and ≤1.2 in the west, and ≤1.15 for advanced demonstration projects.

According to CDCC and Inspur, the PUE of data centers using air cooling solutions is generally around 1.4-1.5, while liquid cooling data centers can reduce PUE to below 1.2, meeting relevant policy requirements. We believe that adopting more energy-efficient and higher-efficiency cooling technologies is the trend, and liquid cooling technology may further open up growth space.

Chip Cooling Market: High-end processor shipments increase significantly + power consumption rises, driving simultaneous increases in volume and price

As the market scale of AI chips and AI servers expands, and chip power consumption increases, raising cooling requirements, we believe that the growth rate of the chip-level cooling market is expected to increase. The AI chip and AI server market is growing rapidly, with NVIDIA’s revenue doubling year-on-year for three consecutive quarters. According to Precedence, the global AI chip market is expected to reach $47.7 billion by 2026, with a CAGR of 29.72% from 2024 to 2026; in FY2024 Q4, NVIDIA’s revenue reached $22.1 billion, a quarter-on-quarter increase of 22% and a year-on-year increase of 265%, achieving revenue growth of double year-on-year for three consecutive quarters. According to Statistics, the global AI server shipment volume is expected to reach 2.369 million units by 2026, with a projected CAGR of 25.50% from 2024 to 2026. The increasing power capacity of AI chips is expected to drive growth in the cooling market. In 2024, NVIDIA will release the B200, which uses the N4P process and packages 208 billion transistors, while the H100 has 80 billion transistors and uses the N4 process, leading to increased packaging density and a power consumption of 1000W, raising higher requirements for cooling technology.

Telecom Operators: Liquid cooling is expected to reach a 50% penetration rate by 2025

Telecom operators may promote the gradual development of liquid cooling technology through technical verification and large-scale experiments. In 2023, the three major operators jointly released a white paper on liquid cooling technology, proposing a “three-year vision”: 1) 2023: Conduct technical verification of liquid cooling technology, fully validate its performance, reduce PUE, and reserve technical capabilities for planning, construction, and maintenance; 2) 2024: Conduct large-scale testing, with 10% of new data center projects applying liquid cooling technology as a pilot, promoting the maturity of the industrial ecosystem. Promote the decoupling of liquid cooling cabinets and servers, enhance competition, and promote the maturity of the industrial ecosystem, reducing the total lifecycle cost; 3) 2025: Conduct large-scale applications, with over 50% of data center projects applying liquid cooling technology, jointly promoting the formation of a high-quality development pattern with unified standards, a complete ecosystem, optimal costs, and large-scale applications.

Report Excerpt:

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

Analysis Report on the Chip Cooling Industry Chain

-end-Technical support provided by “Yiban Editor”The above materials are reprinted from the “Thermal Management Expert” online platform, and the article is for communication and learning purposes only. Copyright belongs to the original author. If there is any infringement, please inform us to delete it.

Leave a Comment