The Development Path of Arm Servers from v8 to v9

Related Reading: What are the differences between Arm architecture upgrades, v9 and v8 versions?01ARM: The 3A Major WorkThe separation of CPU design and manufacturing through the foundry model provides AMD with a high degree of flexibility. The second and third generation EPYC processors can relatively freely choose different processes to match the specific needs of chip design, objectively helping AMD to “compete with giants” and continuously capture market share from Intel.However, the beneficiaries of this flexibility are more AMD itself. Large-scale users like AWS and Alibaba Cloud are not satisfied with traditional customization that mainly adjusts core counts, operating frequencies, and TDP metrics; they want more autonomy in CPU design; or emerging CPU suppliers like Ampere need to choose suitable technology routes… Arm is almost the only answer in the server CPU market.If TSMC helps solve the manufacturing issues of CPUs, then Arm helps solve the design issues of CPUs.Cortex Incubates NeoverseFor “3A” customers like Amazon (Annapurna Labs), Alibaba (Pingtouge), and Ampere, who have sufficient chip design capabilities, Arm’s Neoverse platform provides the foundation for designing a server CPU, including the microarchitecture of CPU cores and supporting processes.Arm’s positive attack on the server CPU market can be traced back to October 2011, when Arm released the optional 64-bit architecture (AArch64) ARMv8-A. A year later, Arm released the microarchitectures Cortex-A53 and Cortex-A57 that implement the ARMv8-A 64-bit instruction set, with AMD announcing plans to launch corresponding server products—AMD’s years of experience in the server market was precisely what the Arm camp lacked at that time.In the following years, chip suppliers like Cavium, Qualcomm, and domestic Huaxintong, as well as large-scale users like Microsoft, actively promoted the entry of 64-bit Arm into the data center market. However, the truly large-scale deployment should start from November 2018 when AWS previewed its first Arm server CPU—Graviton.Graviton is based on the Cortex-A72 launched in 2015 (the successor to A57), with a 16nm process, 16 cores, and 16 threads. Compared to contemporary x86 server CPUs, it is somewhat “unremarkable”, relying heavily on Amazon’s “own child” for full optimization.The Cortex-A family is already the most performance-oriented among the Cortex trio, but it is not designed for server platforms, and cannot relax power consumption limits to boost performance. Therefore, a month before Graviton was made public, Arm released the Neoverse platform aimed at cloud computing and edge infrastructure, starting with the 16nm A72 and A75, codenamed Cosmos.△ Neoverse Scalable Computing PlatformJust four months later, in February 2019, Arm updated the roadmap for the Neoverse platform, launching the 7nm Neoverse N1, which offers over 30% performance improvement compared to previous targets.Codenamed Ares, the Neoverse N1 is based on the Cortex-A76 launched in 2018, both having the same pipeline structure, with an 11-stage short pipeline design, and a front end that is a 4-wide fetch/decode unit. Arm refers to it as an “accordion” pipeline because it can overlap the second prediction stage with the first fetch stage and the scheduling stage with the first issue stage, reducing the pipeline length to 9 stages based on instruction length differences. The L2 Cache also adds an optional 1MiB capacity, which is twice that of A76.△ 4 vCPU Configuration, Neoverse N1’s Integer Performance Improvement Relative to Cortex-A72Compared to the previous generation A72 platform, the Neoverse N1 platform brings significant performance improvements, with many projects doubling their scores, especially in landmark machine learning projects, where scores are close to five times that of the previous generation product. Although A72 is somewhat older, this performance gap also indicates that Neoverse N1 has indeed made a qualitative leap.Graviton2 and Altra SeriesThe Neoverse N1 platform has had a significant impact on the data center market, as everyone sees its immense potential and value, as well as the opportunities behind it. If the previous A72 was just beginning to emerge in the data center market, then Neoverse N1 has made more people believe that Arm has the capability to share a piece of the pie in this field.Two 7nm CPUs from cloud service providers and independent CPU suppliers are both based on Neoverse N1.In November 2019, AWS announced the Graviton2 processor:

The core count surged to 64, four times that of the first generation;
The transistor count increased sixfold, reaching 30 billion;
64MiB L2 Cache, eight times that of the first generation;
DDR4-3200 memory interface (frequency) is twice that of the first generation;
Operating frequency of 2.5GHz, slightly higher than the first generation’s 2.3GHz.

△ In the new EC2 instances added by AWS in 2020, Graviton2 occupies a significant share, and the ratio of Intel and AMD is also noteworthy.Based on Graviton2, the EC2 (Elastic Compute Cloud) instance types rapidly increased, including but not limited to general-purpose (M6g, T4g), compute-optimized (C6g), and memory-optimized (R6g, X2gd). The deployment regions and numbers have steadily grown since mid-2020—statistics show that in 2020, 49% of the new AWS EC2 instances were based on AWS Graviton2.Armv9: A New BeginningIn November 2011, Armv8 was announced, bringing Arm into the 64-bit era. With the joint efforts of Arm and its ecosystem partners, after several product iterations, the Arm camp has established a foothold in the server market over the past decade.At the end of March 2021, Armv9 was released, focusing on upgrading capabilities in security, machine learning (ML), and digital signal processing (DSP) based on Armv8.Among the three major features brought by the new architecture, machine learning may be the most familiar and concerning topic for the public. With the rise of heterogeneous applications, AI technologies represented by machine learning have penetrated various aspects of our lives, whether in backend data centers or at the terminal and edge, machine learning has great potential.To better enhance the computing power required for AI and DSP, ARMv9 upgraded the previously supported Scalable Vector Extension (SVE) to version 2.0. This technology can improve the performance of machine learning and digital signal processing applications, aiding in the processing of a series of workloads such as 5G systems, VR/AR, and machine learning.SVE2 provides adjustable vector sizes ranging from 128b to 2048b, allowing for variable granularity of vectors at 128b, which is not affected by the hardware platform. This means that software developers will only need to compile their code once, and it can run on Armv9 and subsequent products, achieving “write once, run anywhere”. Similarly, the same code will be able to run on more conservative designs with lower hardware execution width capabilities, which is crucial for Arm designs from IoT, mobile to data center CPUs.SVE2 extensions also add the ability to compress and decompress code and data within CPU cores, as moving data in and out of chips consumes a lot of power, maximizing the use of on-chip data can reduce this data movement, thereby lowering energy consumption.More importantly, the Confidential Compute Architecture (CCA) is the most significant content of this version update. In recent years, security issues have become increasingly severe, with ransomware and hacker attacks occurring constantly. In the face of the ever-increasing network attack problems, efforts are needed from both network service providers and software companies, as well as hardware infrastructure providers like Arm to block potential vulnerabilities from the source, leading to the emergence of CCA. This is a security protection capability based on architecture, creating a hardware-based secure execution environment to perform computations, protecting certain code and data from being accessed or modified, and even from privileged software.△ Arm Confidential Compute Architecture (left), Memory Tagging Extension technology introduced by Android 11 and OpenSUSE (right)To this end, CCA introduces the concept of dynamically creating confidential realms—this is a secure containerized execution environment that supports secure data operations, isolating data from the hypervisor or operating system. The management function of the hypervisor is undertaken by a “realms manager”, while the hypervisor itself is only responsible for scheduling and resource allocation. The advantage of using “realms” is that it greatly reduces the trust chain of running a given application on the device, making the operating system largely transparent to security issues, and allowing critical task applications that require supervisory control to run on any device.

In practical applications, memory is a very easy target for attacks, and memory safety has always been a focus of the industry. How to detect problems before memory safety vulnerabilities are exploited is an important step in improving global software security. To this end, the “Memory Tagging Extension” (MTE) technology developed in continuous cooperation between Arm and Google has also become a part of Armv9, capable of identifying spatial and temporal safety issues in memory within software, linking pointers to memory with tags, and checking whether this tag is correct when using the pointer. If access exceeds the range, the tag check will fail, allowing for immediate detection and blocking of memory safety vulnerabilities.

What are the differences between Arm architecture upgrades, v9 and v8 versions?

Over the past few years, Arm has made improvements to the ISA and various updates and extensions to the architecture. Some of these may be very important, while others may be just a glance.

Recently, as part of Arm’s Vision Day event, the company officially released the first details of its next-generation Armv9 architecture, laying the foundation for Arm to become the next computing platform for 300 billion chips in the next decade.

A major question that readers may ask is, what exactly are the differences between Armv9 and Armv8 that allow for such significant improvements in architecture? Indeed, from a purely ISA perspective, v9 may not achieve a fundamental leap compared to v8 as v8 did compared to v7, which introduced AArch64, a completely different execution mode and instruction set that has larger microarchitecture branches compared to AArch32, such as extended registers, 64-bit virtual address space, and more improvements.Armv9 continues to use AArch64 as the baseline instruction set, but adds some very important extensions to its functionality to ensure the increase in architecture numbering, allowing Arm to not only gain new features for AArch64 through some software re-benchmarking of v9, but also to maintain the extensions gained on v8 over the years.Arm believes that the new architecture Armv9 has three main pillars: security, AI, and improved vector and DSP capabilities. For v9, security is a very important theme, and we will delve into the new details of the new extensions and features, but first, the DSP and AI capabilities should be straightforward.

The maximum new features promised by the new Armv9 compatible CPUs are likely to be immediately visible to developers and users—SVE2 as the successor to NEON.The Scalable Vector Extension (SVE) debuted in 2016 and was first implemented in Fujitsu’s A64FX CPU core, which supports Japan’s top supercomputer Fugaku. The issue with SVE is that the range of the first iteration of the new variable vector length SIMD instruction set is quite limited and is more targeted at HPC workloads, lacking many more general instructions still covered by NEON.SVE2 was released in April 2019, aiming to address this issue by supplementing the new scalable SIMD instruction set with the required instructions to serve workloads like DSP that still use NEON.

In addition to the various modern SIMD features added, the advantages of SVE and SVE2 also lie in their variable vector sizes, ranging from 128b to 2048b, allowing for variable granularity of vectors at 128b, regardless of the hardware they run on. From the perspective of vector processing and programming, this means that software developers will only need to compile their code once, and if a future CPU has native 512b SIMD execution pipelines, that code will be able to fully utilize the entire width of the unit. Similarly, the same code will be able to run on more conservative designs with lower hardware execution width capabilities, which is crucial for Arm designs from IoT, mobile to data center CPUs. While retaining the 32b encoding space of the Arm architecture, it can accomplish all of this. However, architectures like X86 require new instructions and extensions to be added based on vector sizes.

Machine learning is also seen as an important component of Armv9, as Arm believes that in the coming years, an increasing number of ML workloads will become commonplace, including scenarios with critical performance or power efficiency requirements. This makes it a long-term need to run ML workloads on dedicated accelerators, while we will continue to run smaller-scale ML workloads on CPUs.Matrix multiplication instructions are key here, representing an important step towards the broader adoption of v9 CPUs as a fundamental feature in the ecosystem.Generally, I believe SVE2 may be the most important factor in ensuring the upgrade to v9, as it is a more definitive ISA feature that can distinguish itself from v8 CPUs in everyday use, and can ensure that the software ecosystem operates normally, which is different from the existing v8 stack. For Arm in the server domain, this has actually become quite a significant issue, as the software ecosystem is still based on v8.0 software packages, which unfortunately lack the most important v8.1 large system extensions.Advancing the entire software ecosystem and assuming that the new v9 hardware has new architectural extension capabilities will help drive things forward and may resolve some current situations.However, v9 involves not only SVE2 and new instructions, but also places a strong emphasis on security, where we will see some more fundamental changes.

Introducing the Confidential Compute Architecture

In recent years, security and hardware security vulnerabilities have become top priorities in the chip industry, with the emergence of vulnerabilities like Spectre and Meltdown and all their peer side-channel attacks indicating that rethinking how to ensure security has become a fundamental requirement. Arm’s approach to solving this overall problem is to redesign how secure applications work by introducing the Arm Confidential Compute Architecture (CCA).

Before continuing, I want to point out that today’s disclosure is merely a high-level explanation of how the new CCA operates. Arm states that more details about the exact workings of the new security mechanism will be announced later this summer.

The goal of CCA is to gain greater benefits from the current software stack situation, where applications running on devices must inherently trust the operating system and hypervisor they run on. The traditional security model is based on the fact that higher privilege software layers are allowed to view the execution of lower layers, but this can become a problem when the operating system or hypervisor is compromised in any way.CCA introduces the new concept of dynamically creating “realms”, which can be seen as a secure containerized execution environment that is completely opaque to the OS or hypervisor. The hypervisor will still exist but will only be responsible for scheduling and resource allocation. The “realm” will be managed by a new entity called the “realm manager”, which is considered a new piece of code, roughly one-tenth the size of the hypervisor.Applications within the realm will be able to “prove” the realm manager to determine whether it is trustworthy, which is impossible for traditional hypervisors. The Development Path of Arm Servers from v8 to v9 Arm has not delved into what exactly causes the isolation between the realm and the non-secure world of the operating system and hypervisor, but it sounds like hardware-supported address spaces that cannot interact with each other.

The advantage of using realms is that it greatly reduces the trust chain of running a given application on the device, and the OS becomes increasingly transparent to security issues. In contrast to the current situation where enterprises or businesses need to use dedicated devices with authorized software stacks, critical task applications that require supervisory control will be able to run on any device.

MTE (Memory Tagging Extensions) is not a new feature of v9, but was introduced with v8.5. MTE or Memory Tagging Extensions aim to help address two of the most persistent security issues in the world of software. Buffer overflows and use-after-free are ongoing software design problems that have been part of software design for the past 50 years and may take years to identify or resolve. MTE aims to help identify such issues by tagging pointers at allocation and checking them at use.

Future Arm CPU Roadmap

This is not directly related to v9, but closely related to the technical roadmap of the upcoming v9 design. Arm also discussed some expectations regarding the performance of v9 designs over the next two years.

Arm talked about how the mobile market has improved performance by 2.4 times this year with devices featuring X1 (here we are only referring to IPC of ISO process design), which is twice that of the Cortex-A73 launched a few years ago.Interestingly, Arm also discussed the Neoverse V1 design and how it achieves 2.4 times the performance of A72-like designs, revealing that they are looking forward to the first V1 devices to be released later this year.For the next-generation mobile IP cores codenamed “Matterhorn” and “Makalu”, the company publicly disclosed that the expected IPC gain for these two generations is 30%, excluding any frequency or other performance gains that SoC designers can achieve. This actually represents a 14% generational increase for these two new designs, and as shown in the performance curve in the slides, it indicates that the pace of improvement is slowing compared to the work Arm has managed over the past few years since A76. However, the company noted that the pace of progress is still far ahead of the industry average. But it also admitted that this has been dragged down by some industry participants.

Arm continues to view CPUs as the most universal computing module for the future. While dedicated accelerators or GPUs will have a place, they struggle to address some important issues such as programmability, protection, universality (essentially the ability to run them on any device), and the ability to work normally.Currently, the computing ecosystem is extremely fragmented in how it operates, with not only different device types but also different vendors and operating systems.

SVE2 and matrix multiplication can greatly simplify the software ecosystem and allow computing workloads to move forward in a more unified manner, which will be able to run on any device in the future.

Finally, Arm also shared new information about the future of Mali GPUs, revealing that the company is developing new technologies such as VRS, especially Ray Tracing. This is quite surprising and indicates that the desktop and console ecosystems driven by AMD and Nvidia’s introduction of RT are also expected to push the mobile GPU ecosystem towards RT.

Armv9 Design Set to Debut in Early 2022

Today’s announcement comes in a very high-level form, and we hope that Arm will discuss more details and new features of Armv9, such as CCA, in the company’s usual annual technology disclosures in the coming months.Overall, Armv9 seems to be a combination of a more fundamental ISA shift (which can be seen as SVE2) and a general re-benchmarking of the software ecosystem to summarize the last decade of v8 extensions and lay the groundwork for the next decade of Arm architecture.Arm has already discussed Neoverse V1 and N2 in the second half of last year, and I do hope that N2 will at least eventually be based on v9 for design release. Arm further revealed that more CPU designs based on Armv9 (possibly the follow-up products of mobile Cortex-A78 and X1) will be launched this year, and the new CPUs may have already been adopted by the usual SoC suppliers and are expected to appear in commercial devices in early 2022.Source:E-Enterprise Research Institute, Semiconductor Industry ObservationThe “Complete Manual of ARM Series Processor Application Technology” contains 16 chapters (469 pages of valuable PDF),Downloadlink:The Complete Manual of ARM Series Processor Application Technology.Related Downloads:CPU and GPU Research Framework Collection1. Industry Deep Dive Report: GPU Research Framework2. Xinchuang Industry Research Framework3. ARM Industry Research Framework4. CPU Research Framework5. Domestic CPU Research Framework6. Industry Deep Dive Report: GPU Research FrameworkARM CPU Processor Data Collection (1)ARM CPU Processor Data Collection (2)Complete Manual of ARM Series Processor Application TechnologyOpen Source Applications of Arm Architecture ServersArm Architecture Servers and StorageAnalysis of Server Hardware ArchitectureResearch on the Current Status of the Server MarketDisclaimer:This account focuses on sharing related technologies, and the content and views do not represent the position of this account. All traceable content is duly credited. If there are copyright issues with published articles, please leave a message to contact for deletion, thank you.Recommended ReadingFor more architecture-related technology knowledge summaries, please refer to the “Architect’s Complete Technical Data Package” related e-books (37 bookstechnical data package summary details can be obtained by “reading the original text“.All store content is continuously updated, and by ordering the “Complete Technical Data Package for Architects (All)“, you can enjoy “free” access to updates for all store content, priced at only 198 yuan (original total price 350 yuan).Warm Reminder:Please search for “AI_Architect” or “scan code” to follow the public account for real-time access to in-depth technical sharing, click “read the original text” to get more originaltechnical content. The Development Path of Arm Servers from v8 to v9

Related posts

Leave a Comment Cancel reply