The Evolution of InfiniBand in AI Computing Hardware

This article will focus on the AIGC content generation and processing series, where you can find more exciting articles about AIGC content generation and processing in the service menu of our official account.

AI, represented by large models, is driving a new wave of technological revolution and industrial transformation, bringing enormous opportunities and disruptive challenges to various industries. In the field of content production, AIGC content generation is a key force in advancing the transformation of new media, and the implementation of industry large models will provide new paths for media companies to explore new media technology revolutions.

In this round of AI revolution, AI hardware boards are the core of AI computing power. Without the support of AI hardware computing power, even the largest models are of no use; even the simplest model parameter tuning requires support from AI computing power boards, or it must use open cloud resources, which, of course, also rely on professional AI computing power boards.

At the same time, we also see a shortage of AI computing infrastructure in China. Currently, the construction of data centers and cloud computing centers is no longer centered around CPUs but is expanding based on GPUs, which operate in parallel. This requires us to fully recognize the importance of GPUs.

This series of articles will discuss the rise of AI computing power boards GPU, the leader in AI hardware computing power boards, NVIDIA, and the situation of domestic AI computing hardware boards, and which company will become the domestic alternative to NVIDIA will be explained in detail.

In previous articles, we introduced NVIDIA’s GPU development stance, the logic of CUDA cores and programming, as well as the content of NvLink and NvSwitch. We all know that NVIDIA has become an absolute leader in the field of AI hardware not only due to CUDA and NvLink but also because of InfiniBand, which provides infinite bandwidth and is a key to NVIDIA’s dominance in the hardware field. Today, we will start introducing InfiniBand.

01. Early History of InfiniBand

InfiniBand is actually a high-speed communication technology protocol. In English, it translates to “infinite bandwidth.”

We all know that the digital computers we use today have always adopted the Von Neumann architecture since their inception. In this architecture, there are CPUs (arithmetic and control units), memory (RAM, hard disk), and I/O (input/output) devices.

In the early 1990s, to support an increasing number of external devices, Intel was the first to introduce the PCI (Peripheral Component Interconnect) bus design into standard PC architecture.

The Evolution of InfiniBand in AI Computing Hardware

The digital economy has developed vigorously in the 21st century, with technology advancing rapidly, which requires computer hardware performance to upgrade quickly. We all know the famous Moore’s Law, which states that computer hardware performance doubles every 18 months. However, during the rapid advancement of hardware technology, the update speed of the PCI bus within computers has not kept pace, greatly limiting the overall I/O performance of computers.

To solve this internal high-speed communication problem, Intel, Microsoft, and SUN led the development of the “Next Generation I/O (NGIO)” technology standard. Meanwhile, IBM, Compaq, and HP led the development of “Future I/O (FIO).”

In 1999, the FIO Developers Forum and the NGIO Forum merged to establish the InfiniBand Trade Association (IBTA).

The Evolution of InfiniBand in AI Computing Hardware

In 2000, the 1.0 version of the InfiniBand architecture specification was officially released. The original intention of InfiniBand was to replace the PCI bus communication method. It introduced the RDMA protocol, supporting lower latency, greater bandwidth, and higher reliability, enabling stronger I/O performance.

So what does NVIDIA have to do with InfiniBand? This brings us to a company called Mellanox, which NVIDIA later acquired, continuing to create a peak of development.

In May 1999, several employees who left Intel and Galileo Technology founded a chip company in Israel and named it Mellanox.

After its establishment, Mellanox joined NGIO. Later, when NGIO and FIO merged, Mellanox also joined the InfiniBand camp. In 2001, they launched their first InfiniBand product.

In 2002, Intel chose to withdraw from the InfiniBand camp and decided to develop PCI Express (PCIe). Another giant, Microsoft, also exited InfiniBand development. Although companies like SUN and Hitachi chose to stick with it, the development of InfiniBand had already been overshadowed.

Starting in 2003, InfiniBand turned to a new application field, namely computer cluster interconnection. That year, Virginia Tech created a cluster based on InfiniBand technology, ranking third in the TOP500 (the world’s top 500 supercomputers) test at that time.

In 2004, another important non-profit organization for InfiniBand was born—the OFA (Open Fabrics Alliance).

OFA and IBTA are actually cooperative. IBTA is mainly responsible for developing, maintaining, and enhancing the InfiniBand protocol standard; OFA is responsible for developing and maintaining the InfiniBand protocol and upper-layer application APIs.

02. Rapid Development of InfiniBand

In 2005, InfiniBand found a new scenario—connecting storage devices. That year, InfiniBand and FC (Fibre Channel) technology were widely used in the storage system application of SAN (Storage Area Network).

InfiniBand technology gradually gained popularity, with more and more users, and its market share continued to rise. By 2009, an increasing number of high-speed communications adopted InfiniBand technology.

During the rise of InfiniBand, Mellanox also grew stronger, gradually becoming the leader in the InfiniBand market.

After 2012, with the continuous growth of demand for high-performance computing (HPC), InfiniBand technology continued to thrive, with its market share continuously increasing. In 2015, the share of InfiniBand technology in the TOP500 list surpassed 50% for the first time, reaching 51.4% (257 sets). This marked the first time InfiniBand technology achieved a comeback against Ethernet technology. InfiniBand became the preferred internal connection technology for supercomputers.

In 2013, Mellanox successively acquired silicon photonics technology company Kotura and parallel optical interconnect chip manufacturer IPtronics, further improving its industry layout. By 2015, Mellanox’s market share in the global InfiniBand market reached 80%. Their business scope gradually extended from chips to network cards, switches/gateways, remote communication systems, and cables and modules, becoming a world-class network provider.

In 2019, NVIDIA spent $6.9 billion, defeating rivals Intel and Microsoft (who offered $6 billion and $5.5 billion respectively) to successfully acquire Mellanox.

Regarding the reason for the acquisition, NVIDIA CEO Jensen Huang explained:

“This is a combination of two global leaders in high-performance computing; we focus on accelerated computing, while Mellanox focuses on interconnect and storage.”

In the following time, we witnessed the rise of AIGC large models, and the entire world experienced an explosion in demand for high-performance computing and intelligent computing.

To support such a massive computing power demand, it is necessary to rely on high-performance computing clusters. InfiniBand, in terms of performance, is the best choice for high-performance computing clusters.

Combining its GPU computing power advantages with Mellanox’s networking advantages creates a powerful “computing power engine.” In terms of computing power infrastructure, NVIDIA undoubtedly occupies a leading advantage.

Today, in the competition for high-performance networks, it is a battle between InfiniBand and high-speed Ethernet. Both sides are evenly matched. Well-funded manufacturers are more likely to choose InfiniBand, while those seeking cost-performance ratios tend to prefer high-speed Ethernet.

There are also some technologies, such as IBM’s BlueGene, Cray, and Intel’s OmniPath, which basically belong to the second tier.

Leave a Comment Cancel reply