In recent years, companies like Apple and NVIDIA have introduced ARM-based processors like the Apple M1 and Grace, raising expectations for ARM’s application in the server domain.ARM architecture processors are recognized as an important architecture for next-generation server processors due to their excellent energy efficiency and support for cloud computing.Fountainbridge Capital has consistently focused on the next generation of computing platforms and their core hardware and software, continuously investing in areas such as CPU, GPU, quantum computing, foundational software, and data centers.

This article is a summary of the development of ARM applications in servers by Dr. Winne Shao. Dr. Winne Shao has extensive experience in the ARM server field, having served as Marketing Director and Senior Product Manager at ARM China and Pingtouge Semiconductor, witnessing the development of ARM servers firsthand. Reflecting on history helps us understand and seize the immense opportunities that may arise.

Author: Winnie Shao

Source: Enterprise Storage Technology

1. The First Wave: Starting in 2008

In 2008, ARM began planning its server strategy. Taking action on impulse, ARM invested in a startup company called Smooth Stone, later renamed Calxeda. The initial round of investment was $48 million.

Calxeda’s initial goal was to reduce energy consumption in data centers and increase computing density in the same space. Remember these two goals; at this moment, our original intention remains unchanged.

At that time, the market was dominated by Cortex-A8 products, and the first multi-core Cortex-A9 products would not be launched for another three years. Intel’s Xeon at that time had only four cores, although the frequency was already at 3.x GHz, while AMD’s 45nm Opteron CPU had just been released. That year, IBM announced its Power product line, starting with a staggering 64 cores. Apple released the iPhone 3G, which was essentially the iPhone 2. TSMC’s mainstream process was 40nm, with an annual revenue of $10 billion.

1.1 Calxeda in 2011

In 2010, Smooth Stone officially changed its name to Calxeda and moved its headquarters to Austin.

In 2011, Calxeda released its EnergyCore ECX-1000 chip based on the A9 architecture.

The Evolution of ARM Servers: Three Waves and a History

This is actually a design worth examining, consisting of a processor module made up of four Cortex-A9 cores, which is quite conventional, while the I/O controllers are also standard interfaces (standard interfaces are not easy to achieve; the core of a good product is to excel in the conventional parts). However, the management engine and fabric switch parts are very innovative technologies.

The EnergyCore Fabric is an integrated L2 switch supporting mesh, butterfly tree, and 2D Torus topologies, with bandwidth between virtual ports that can be allocated at different specifications of 1 Gb/sec, 2.5 Gb/sec, 5 Gb/sec, and up to 10 Gb/sec per core. Through it, server nodes can autonomously form networks without needing to go through an on-top switch. Thus, a Calxeda board with four chips can achieve 16 cores, supporting server systems with up to 480 cores.

This design philosophy makes sense; if you design a very low-cost server chip but the supporting network is still expensive, high-density designs will only increase costs. This fabric can connect 1024 system boards, equating to 4096 chips using a 10G network interface.

The EnergyCore Management Engine is an integrated BMC that supports IMP2.0 and DCMI, and also supports remote debugging via SoL protocol. The management engine’s strongest feature is power management, allowing Calxeda’s server chip to dynamically adjust power consumption from 4W to 1W. The manufacturing cost per node is approximately $28.

1.2 Computex 2012

In 2012, Ian Ferguson publicly spoke at Computex in Taipei, marking ARM’s first introduction of its server efforts to the public. He shared the stage with Mark from Ubuntu, who quoted Frank from Facebook regarding the value of performance per watt per dollar.

1.3 Marvell Armada XP 2013

During the first wave of server developments, the Marvell Armada XP quad-core series is also noteworthy. However, the core here is neither A9 nor A15, but a custom core from Marvell.

This highly integrated, low-power SoC is very suitable for storage applications. Dell used it as the core for its “Copper” ARM server system. Baidu had also used it. This was the first case of ARM servers in internet companies.

1.4 Calxeda’s Closure 2013

If Calxeda could have raised a third round of investment, according to the roadmap below, an A15 chip would have been mass-produced soon, with two Armv8 chips planned. Unfortunately, the fundraising efforts were unsuccessful. Calxeda, from its establishment in 2008 to its closure in 2013, had a total investment of $103 million ($48 million in 2010 and $55 million in 2012), with a total of 130 employees.

In its closure email, it stated that the emergence of ARM servers would “transform the industry forever.” Looking back now, it certainly has.

Analysts from Insight 64 remarked that they had spent too much on 32-bit ARM servers. Indeed, in 2011, ARM announced the 64-bit Armv8, and Applied Micro unveiled the X-gene plan, indicating that the second wave of Armv8 server developments had already begun. Calxeda’s closure marked the end of the first wave.

2. The Second Wave: ARMv8 2011

The three keywords of the second wave are custom cores, mainstream performance, and standard design. In the early years of ARM servers, chip design companies from various fields submitted their products based on their understanding of server CPU chips. I focus on APM’s X-gene, Cavium’s ThunderX, and Qualcomm’s Centriq 2400, while also providing clues for other chips for interested readers to explore further.

2.1 AppliedMicro – X-gene 2011

In October 2011, when ARM first announced the ARMv8 architecture, Applied Micro unveiled their custom architecture X-gene plan.

The first generation of X-gene features eight custom cores, Storm, with two cores sharing a 256KB L2 cache, differing from ARM’s four cores in a cluster. The AMD Opteron A1100 processor, codenamed Seattle, discussed in the next chapter, also does not use four Cortex-A57 cores in a single cluster but rather two A57 cores in four clusters. AMD’s A1100 has two A57 cores sharing 1MB of L2 cache, four times larger than the X-gene. However, X-gene’s custom core Storm is a 4-issue core, while A57 operates at a sweet spot of 3-issue efficiency ratio.

X-gene features eight cores and is equipped with four memory channels, which is also rare in the x86 camp regarding CPU:memory ratios. It also integrates two 10G NICs, supporting RoCE, showcasing its SoC advantages.

According to the power consumption parameters provided by Applied Micro, under full load, one core consumes 2 watts, while in idle state, it consumes only 0.5 watts.

The part of the X-Gene design that impressed me the most is MSLIM, which is a small processor cluster composed of four A5 cores that provides acceleration. I am unsure if any customers used this processor group or the design philosophy of that year.

Anandtech had a detailed and somewhat negative review of X-gene. The main point was its immaturity; the performance and energy efficiency advantages were not evident. It tested HPE’s Moonshot system, and HPE’s official documents rated X-gene highly, as it was the first mass-produced ARM 64-bit server chip, and early software partners used its systems.

2.2 AMD A1100 2012

A year after the ARMv8 architecture was launched, ARM released two products from the Cortex-A5x series, A57 and A53, following international conventions. A heavyweight partner made a dazzling appearance at the launch event alongside ARM: AMD.

This chip, codenamed Seattle, belongs to the Opteron series, and the official product name is A1100, which is now absent from AMD’s mainline product history.

At that time, AMD spent a long time explaining why it ventured into ARM servers, how it positioned its x86 and ARM product lines internally, and even to quell external doubts, it launched the K12, which only lived in the news (2015).

Looking back at 2012, one term cannot be overlooked: “microserver.” At that time, AMD had just acquired SeaMicro, a company focused on building high-density, low-power systems around Freedom Fabric. This Fabric, very-high-density and low-power, sounds familiar, right? It follows Calxeda’s approach. The following image shows a 10U dimension housing 768 CPUs, including four GE switches and a load balancer.

In such a system design, equipping it with a super low-power ARM processor is quite reasonable. Therefore, choosing the standard core Cortex-A57 shortens development time and saves costs, which is a logical step.

Information on Cortex-A57 is ubiquitous, so I won’t list it here. As mentioned earlier, AMD chose a 2 core 4 cluster configuration instead of the 4 core 2 cluster common in mobile APs. The list price of this chip is $150, which is quite competitive.

In a sense, although AMD’s Seattle is included in the second wave, its design theory is entirely from the first wave. The K12 is the second wave. However, looking at K12’s design goals within AMD’s framework, the motivation to pursue ARM is clearly x86. Jim Keller, the man, was originally connected with K12.

Intel’s response to this wave was the 14nm “Xeon-D.”

2.3 Cavium ThunderX 2014

To some extent, Cavium’s 48-core ThunderX is the product that truly initiated the second wave of ARM server developments. It encompassed all the characteristics that a mainstream server chip should have, such as dual-socket support and performance.

Cavium, being a company only one-tenth the size of AMD, had early capabilities in designing high-core processors, although previously it was focused on MIPS network application processing.

Although it only had a 2-issue custom core, its single-core performance was relatively weak. However, the overall SoC design, particularly in multi-socket configurations, was exceptional. Moreover, due to its accumulation in network processors, this chip had a rich set of acceleration engines and IO interfaces.

To reduce power consumption, it could selectively turn off acceleration engines, resulting in four different configurations: cloud computing version, storage version, carrier version, and security version.

Anandtech has a very good performance test that helps to understand Cavium ThunderX.

2.4 Broadcom Vulcan 2016

This section is quite complicated. If we talk about Broadcom Vulcan, it pertains to around 2016. If we mention Cavium’s ThunderX2, that is a product from 2018. Then it quickly became Marvell’s ThunderX2. Originally, these were concurrently planned products, but various twists led to a merger. Sometimes, I can’t believe that our industry has such dramatic stories.

Broadcom’s CPU design team, originating from RMI, shares many common points with Cavium’s CPU design team; both are from the MIPS lineage and have a background in networking. However, unlike Cavium’s focus on small cores with 2 issues, Broadcom’s team has always excelled in multi-threading. Thus, when planning, Vulcan was designed to support an incredible four threads. At this time, there were no multi-threaded processors in the ARM camp.

Broadcom’s original design goal was 16nm, with a die size of 600 mm2, featuring 32 cores, each supporting four threads, and dual P system support. After Cavium’s acquisition, the die size was not disclosed.

The highest configuration, CN9980, features 32 cores at 2.5GHz with a TDP of 200W. The CN9980 at 2.2GHz and 180W is priced at $1795, while the 16-core CN9960 at 1.6GHz and 75W is priced at $800.

The target market for this chip, or the visible design wins, concentrated in the HPC market.

2.5 Qualcomm 2017

In the same week of 2017, Qualcomm introduced the 48-core 10nm Centriq 2400, originally codenamed “Amberwing,” and received a $130 acquisition offer from Broadcom.

It is estimated that the Centriq 2400, which took four years to develop, cost between $100 million to $125 million, involving hundreds of engineers. During this time, Qualcomm also created a 24-core Centriq 1200 as a test prototype.

The Centriq 2400 features 18 billion transistors, a die size of 398 mm2, and is manufactured using Samsung’s 10nm process, making it much smaller than ThunderX2. Although it is a single P processor, this does not pose a problem for the long-term development pattern of servers.

This chip, born with a golden key, smoothly reached tape out until the emergence of a black swan named Hock Tang.

From the price-power consumption chart, the pricing of Centriq 2400 is quite similar to that of ThunderX2.

The CPU core of Centriq 2400 is a custom core named “Falkor,” with a maximum frequency of 2.6GHz, representing Qualcomm’s fifth-generation custom core. If there was to be a next-generation core, it would be “Saphira,” with the chip named “Firetail.” However, there was no follow-up; Qualcomm canceled its server chip project, marking the end of the second wave of ARM server developments.

2.6 Samsung 2012-2014

While the main line is completed, the side line should also be written.

Samsung’s ARM server story is less known domestically but has made it to the Wall Street Journal. Samsung has never officially announced it; when the project began, it was all speculation, and when it ended, it was all rumors.

In 2007, Samsung invested $3.5 billion to build a factory in Austin, and in 2010 established the Samsung Austin Research Center, beginning to recruit chip design engineers, including a former AMD VP as the VP of Austin. Speculation arose that server chip development was also part of this Austin research center’s plans.

In fact, Samsung’s entry into server SoC design logic could be close to Qualcomm’s, but at the time, Qualcomm had CEO support, and the outcome was dramatic. It’s easy to imagine how difficult it would be for a Korean company’s U.S. branch to support a large server chip design.

2.7 Nvidia Project Denver 2011-2014

Nvidia is a company I greatly respect, and it is one of the few companies in Silicon Valley still led by its founders as CEO. However, I have struggled to write this chapter several times. Perhaps it is because Nvidia remains primarily a GPU company, and its CPU development logic is more application-oriented.

This is a path from Tegra to Carmel, integrating ARM CPUs into complex functional chips. It appears more like a system company’s chip planning path. Given that this article focuses on general server chip analysis, Nvidia’s product line deserves a complete chapter for analysis.

2.8 Balkal

Russia’s first 28nm chip BE-M1000 should not be classified as a server chip; however, it does encompass workstations. This chip company, like Fujitsu in Japan and Feiteng in China, emerged from supercomputing projects and operates independently, focusing more on commercial success.

I had seen their ambitious roadmap back in the day. However, the journey from roadmap to product realization can be fraught with variables, leading to many plans falling by the wayside.

When discussing chip development, this has traditionally been a strength of supercomputing comrades. The aforementioned server SoC also originated from the push of the supercomputing market. In the future, I will mention that European comrades are also starting to strive to develop their own chips.

2.9 Socionext

Socionext’s “SynQuacer™” SC2A11 is probably the only Cortex-A53 24-core chip.

Socionext SC2A11 Block Diagram

This chip should not only be viewed in isolation but also in terms of system design.

Socionext SC2A11 Server System

This type of small-core, high-density system has a familiar feel, reminiscent of designs from that era.

3. The Third Wave: Neoverse 2018

While the second wave was driven by semiconductor industry and system manufacturers, the third wave saw end users diving into the tide themselves.

Drew Henry (I recommend reading his profile on LinkedIn, which serves as a model for executive resumes) is a name that will also be remembered in ARM server history. A year after he joined ARM, in October 2018 at ARM Tech, he announced that ARM had established its own brand in the infrastructure market (Neoverse) and unveiled a roadmap for a new generation every year, improving by 30% with each iteration.

This marks the beginning of the third wave, although it was a quiet undercurrent at the time.

3.1 Huawei Kunpeng 920

On January 7, 2019, Xu Wenwei, known as Big Xu, announced the Kunpeng 920. I am only including what I consider important public images; how to interpret them is up to each of you. This is a world-class product in every respect, including the attention it has garnered.

3.2 AWS Graviton2

Even when AWS released Graviton in November 2018, and referred to this self-made chip with 16 Cortex A72 cores as Neoverse, the world was not shocked. In hindsight, the 16-core A72 [email protected] indeed seemed more like a test, especially considering AWS’s earlier release in 2017, which was developed by the Israeli startup Annapurna Labs acquired in 2015.

However, the AWS Graviton 2 released in 2019 was a stunning product. With 64 cores based on Neoverse N1, 30 billion transistors, and a 7nm process, it is estimated that the die size should be around 300-350mm², officially boasting a performance improvement of over 40% compared to Intel Xeon-based 5th generation processors, as well as up to 25 Gbps network bandwidth and 18 Gbps optimized EBS bandwidth.

3.3 Ampere QuickSilver 2019

Ampere rode the wave of AWS’s Graviton, revealing its next-generation plan for a 7nm 80-core N1 chip codenamed QuickSilver. The most striking feature is that the new chip supports dual-socket configurations, thanks to Arm’s mesh IP (CMN-600) efforts in CCIX.

In addition to the striking 80-core N1 design, QuickSilver boasts a luxurious configuration with 128 PCIe4 lanes. It is also a core partner of Nvidia’s CUDA-on-ARM.

This is the chip I look forward to the most in 2020, as it can actually be purchased, while AWS’s Graviton is only available as a cloud service.

3.4 Marvell ThunderX

Following the momentum of AWS, not only Ampere revealed its next-generation plans, but our veteran company Marvell also informed us that the name of the custom core for the ThunderX3 processor is “Triton,” and we also saw a strong product roadmap promising a new generation every two years, with performance doubling each time.

3.5 Fujitsu A64FX 2016

My favorite is saved for last. A colleague of mine, while discussing memory selection with a partner, said, “You can only choose two of high throughput, large capacity, and low cost,” which is a very philosophical statement. If there are solutions that can balance all three, no one would be troubled. The existence of troubles indicates that there are difficult choices. Personally, I prefer the kind of solution that is “perfect except for being expensive,” but rest assured, when recommending to partners, I will not reveal this personal bias.

Fujitsu’s A64FX is not a server chip; it is designed for supercomputing, precisely that kind of “perfect except for being expensive” product.

In 2016, ARM announced the ArmV8 instruction set extension SVE – Scalable Vector Extension, and according to international conventions, a major customer would appear to support it, which was Fujitsu. Its Post-K supercomputing project, rumored to have received $1.24 billion in funding from the Japanese government, would adopt the ARM architecture instead of their previous SPARC. A64FX is the first Arm processor to support SVE.

In 2018, Fujitsu publicly introduced the A64FX at Hot Chips. First, let’s look at the hard parameters: 8.8 billion transistors (this is not much; AWS Graviton2 has 30B), 7nm. It features 48 custom cores plus 4 homogeneous management cores, effectively forming four processor clusters, each with 13 cores. The interconnect between cores uses Fujitsu’s second-generation TOFU -6D mesh/torus on-chip network (the first generation TOFU is highly praised), equipped with 32GB HBM2 (an ultra-luxurious configuration), and 16 PCIe 3.0 lanes (not many, likely not intended for external devices), with a storage bandwidth of 1024GB/s, yielding 2.7 TFLOPS @ 64-bit and 21.6 TFLOPS @ 8-bit performance. Nvidia’s Tesla P4 and P40, when processing 8-bit integers, respectively achieve 22 TFlops and 47 TFLOPS, making for a competitive comparison.

The A64FX’s cache hierarchy, throughput, execution pipeline, power management, and RAS features are all distinctive; those interested can read the Hot Chips documentation for more details.

The impressive performance of A64FX does not require pairing with GPUs, which is why Cray collaborates with Fujitsu to integrate A64FX into CS500 clusters and future Shasta systems.

3.6 Other New Entrants

In November 2019, a startup called Nuvia emerged during the SC conference. The backgrounds of the founders and the lawsuits involving Apple quickly made headlines. Before any products were released, let us remember their slogan: “deliver industry-leading performance and energy efficiency for the data center.”

The European Processor Initiative (EPI) is also an effort aimed at designing server-grade CPUs. Without further ado, let’s look at the roadmap.

When trying to answer why ARM is pursuing server development, the phrase that comes to mind is “advanced productivity.” What is advanced productivity? Frank Frankovsky, Facebook’s VP of Hardware Design and Supply Chain Operations, is also a name to remember. He proposed the concept of the most useful work per watt per dollar. The actual usable computing power divided by the cost of buying and operating servers represents the productivity of that server, marking its advancement.

Extending this to the industry chain, it becomes the total useful work per total investment, providing useful computing power divided by total financial investment (time, engineers’ wisdom and effort), which serves as an indicator of whether this technology/solution/ISA/product’s productivity is advanced.

True advanced productivity belongs to the entire world.