ARM’s processor architecture has basically changed every year. From the early Cortex-A15 to Cortex-A57, then to Cortex-A72, Cortex-A73, and Cortex-A75, ARM has continuously pushed the performance of mobile computing forward by releasing new architectures in recent years. In May of this year, ARM released the brand new Cortex-A76 architecture, which targets the new 7nm process and is expected to reach new performance heights.
At ARM Tech Day in May, ARM announced a brand new architecture: Cortex-A76. Like all recent architectures named “7x”, the Cortex-A76 is a high-performance big core product. In reality, this new high-performance architecture is not that simple; it might bring ARM into a whole new market and challenge competitors it has never faced directly before.
High-Performance Mobile Architecture from the Austin Family
In this publication’s article “The New King of the Mobile World – In-Depth Analysis of Cortex-A73,” we detailed the products and research history corresponding to ARM’s major R&D centers. ARM has three architecture R&D teams located in Austin, Texas, France, and Cambridge, UK. In the past period, the Austin team and its products have appeared less frequently in front of people; the well-known Cortex-A73, Cortex-A75, and others come from the Sophia team. In fact, silence is for a better explosion; the Austin team began researching future micro-architecture systems as early as 2016, especially in FP/SIMD, with the Cortex-A75 drawing a lot of “nourishment” from the new architecture of the Austin team.
The latest achievement of the Austin team is the Cortex-A76. For ARM, the Cortex-A76 is a newly developed architecture and a new starting point—typically, a company focused on selling IP would not develop a new architecture as it often means high investment. However, ARM did so and claimed it as the fifth-generation “annual node” product. Looking at ARM’s development over the past five years, launching a new micro-architecture annually is similar to Intel’s “Tick-Tock,” but in ARM’s case, it is actually “Tock-Tock-Tock.” ARM claims that its GAGR (Compound Annual Growth Rate) for each generation is as high as 25%, all derived from improvements in micro-architecture.
▲Cortex-A76 is a “laptop” level processor product.
So why does Cortex-A76 deserve so much investment from ARM, prompting the important architecture R&D team to start from scratch and go all out? This is because the advantage of Cortex-A76 lies in its design, which balances high performance and high efficiency. If it were improved from a traditional architecture, researchers would inevitably encounter many constraints, but designing a completely new architecture allows them to eliminate bottlenecks throughout the system and break previous micro-architecture limitations.
The focus of Cortex-A76 is on high performance while maintaining a very high energy efficiency ratio, enabling it to adapt to various scenarios, including power-sensitive mobile devices. Given the excellent characteristics of Cortex-A76, in ARM’s words, the Cortex-A76 is a “laptop” level high-performance processor architecture that is also efficient. This concept ran throughout ARM Tech Day, where ARM hopes to leverage the significant performance improvements of Cortex-A76 to strengthen competition in emerging markets, such as Qualcomm’s promotion of its Snapdragon processors’ “Always Connected PCs,” which is based on this reasoning.
In some broad metrics, ARM’s specific expectations for Cortex-A76 are as follows: a 35% performance increase, a 40% increase in power efficiency (energy efficiency ratio), and support for machine learning, with performance increasing up to four times that of the original architecture. ARM also provided performance comparison benchmarks: the Cortex-A75 architecture using a 10nm process running at 2.8GHz compared to the Cortex-A76 architecture using a 7nm process running at 3GHz.
▲ARM’s performance expectations for Cortex-A76.
In addition, Cortex-A76 is compatible with the latest DynamIQ technology, allowing it to be paired with Cortex-A55 to form a processor cluster that balances high performance and high power efficiency. Manufacturers can launch Cortex-A76 paired with Cortex-A55 in a “1+6” or “2+6” processor cluster, similar to the current Cortex-A75 paired with Cortex-A55. It is worth mentioning that some trade-offs were made in the design of Cortex-A76; for instance, ARM still points out that the Cortex-A75 has the best PPA (performance per area), so the Cortex-A75 is not outdated; the choice of which to use in specific products will depend on the manufacturer’s needs.
▲Cortex-A76 supports DynamIQ, enabling different core configurations.
Next, this article will delve into the core of Cortex-A76, taking you through this brand new architecture.
▲Key points in the design of Cortex-A76 architecture
Cortex-A76 Frontend Architecture
Overall, Cortex-A76 has a structure: a superscalar out-of-order structure with 4 decode frontends (4 issue), 8 execution ports, a total of 13 pipeline stages, and an execution latency of 11 stages. On the frontend, ARM designed a new prediction/fetch unit called “predictive fetching,” meaning the branch prediction unit will intervene in the instruction fetching unit’s work, which is different from all previous ARM micro-architectures, enabling higher performance and lower power consumption.
▲Overview of Cortex-A76 architecture
In terms of the branch prediction unit, ARM has adopted a hybrid indirect predictor for the first time. The predictor is separated from the fetch unit, and its large structure operates independently of the rest of the machine. This independent structure allows for the use of clock gating technology to control power consumption, which is a positive improvement in energy efficiency for the branch prediction unit. Regarding the branch predictor, ARM designed a 3-level branch target buffer, a 16-way nanoBTB, a 64-way microBTB, and a 6000-way main BTB.
In contrast, while ARM claimed that the branch predictors in Cortex-A73 and Cortex-A75 could predict all branches, the branch predictor in Cortex-A76 is evidently more powerful, providing stronger branch prediction results than its predecessors to enhance efficiency.
▲The frontend of Cortex-A76, enhanced prediction functionality.
The branch prediction unit of Cortex-A76 runs at twice the speed of the fetch unit, meaning it can execute up to 8×32bit instruction per cycle, with these instructions forming an instruction fetch queue made up of 12 “blocks” before reaching the instruction fetch unit. In contrast, the fetch unit operates at 16Bit/cycle, executing 4×32bit instructions. The branch prediction unit running at double the speed of the instruction fetch unit can hide branch “bubbles” in the pipeline during prediction errors, preventing errors from affecting the instruction fetch unit and the rest of the core. Some reports indicate that the core can handle up to 8 errors in instruction fetching, significantly improving fault tolerance.
The so-called “bubbles” refer to potential hazards that may cause the pipeline to stall or delay. In previous micro-architectures, even if predictions were correct and the instruction side could send a large number of instructions to the decode side, once the instructions entered the decode side and were broken down into micro-operations, there was a high probability of encountering “bubbles.”
In terms of pipeline, Cortex-A76 has 13 levels of pipeline and an 11-stage core latency. During this process, the critical path and stages in which instructions wait can overlap, such as between the second loop of the branch prediction process and the first loop of the instruction fetch process. In the ideal case, core latency can be reduced by 3 cycles.
In the decode and rename stages, Cortex-A76 has a throughput of 4 instructions per cycle, which is a 4-issue scheme. In contrast, the instruction throughput capabilities of Cortex-A73 and Cortex-A75 in this stage are 2 and 3, respectively, so Cortex-A76 brings about a 33% increase in instruction width compared to Cortex-A75. The instruction throughput capability of Cortex-A72 is 3, but it becomes 2 on Cortex-A73, mainly considering the need for further optimization of efficiency and power, maximizing the utilization of the frontend. With Cortex-A76 entering 4-issue, ARM has introduced its widest micro-architecture, though compared to Samsung or Apple, Cortex-A76 still appears relatively “thin.”
▲Design of the decode and rename stages of Cortex-A76.
The fetch unit of Cortex-A76 provides up to 16 instruction decode queues of 32bit. The pipeline stage consists of 2 instruction alignment and decode cycles. At this step, ARM decided to use 2-cycle units instead of the 1-cycle units from previous architectures. Additionally, when processing ASIMD/FP pipeline instructions, the previous Sophia cores still required an auxiliary cycle during the decode stage, but ARM seems to have found other optimization methods, allowing the micro-architecture of Cortex-A76 to meet design requirements.
Each cycle in the decode stage adopts 4 instructions, outputting macro operations at an average rate of 1.06Mops per instruction. ARM also optimized power consumption in the register renaming stage, similar to the previous branch prediction unit, by adding clock gating to functional modules. The renaming unit in Cortex-A76 is independent, controlled by clock gating for integer, ASIMD, and flag operations.
Moreover, the renaming and scheduling in Cortex-A76 only require 1 cycle, compared to the previous 2 cycles. In terms of macro instructions, Cortex-A76’s macro instructions are split into micro-operations at a ratio of 1.2uop per instruction, allowing for up to 8uops of scheduling per cycle, which is a significant increase compared to the 6uops of Cortex-A75 and the 4uops of Cortex-A73.
In terms of out-of-order execution, Cortex-A76 has a reorder window size of 128, with the buffer divided into two aspects responsible for instruction management and register recovery, which ARM calls a hybrid commit system. It is important to emphasize that ARM has not focused on increasing related units and design aspects here, as ARM found the performance return on investment for this part to be very poor. Some data indicate that a 7% increase in the reorder buffer can yield just a 1% performance increase, so this design only needs to be sufficient. Additionally, ARM indicated that it attempts to optimize the minimum latency in frontend management program activities and system calls, but there is no further news on that.
Cortex-A76 Backend Architecture
Now let’s take a look at the execution part of the backend. The integer core of Cortex-A76 includes 6 execution units, with 4 units in the figure: 1 branch, 2 ALUs, 1 ALU/MAC/DIV unit, plus a load/store unit. Among these, 2 of the ALUs in the integer execution pipeline perform simple arithmetic operations, while 1 complex pipeline executes multiplication triggers and CRC operations. The 3 integer pipelines are served by a 16-depth instruction queue, while the 2 load/store units are served by a 12-depth instruction queue.
▲Backend design of Cortex-A76
In terms of floating point, ARM designed 2 execution units, one of which executes FMUL/FADD/FDIV/ALU/IMAC, while the other is simpler and only executes FMUL/FADD/ALU. The ASIMD floating point core is served by 2 queues of depth 16.
When people discuss backend architecture, they often mention instruction throughput and latency. Cortex-A76 has made significant progress in instruction latency, which is due to its architectural design that can reduce cycles on very important instructions. Cortex-A76 has reduced the latency of multiplication and multiply-accumulate from the previous 3 cycles to 2 cycles, while maintaining the same throughput as Cortex-A75. Clearly, because Cortex-A76 has 3 integer pipelines, this corresponds to a 50% increase in throughput compared to Cortex-A75, along with lower latency.
In the “VX” vector execution pipeline responsible for FP and ASIMD operations, it has more significant improvements, which ARM refers to as a “state-of-the-art” design, although this result has been hyped for many years. From a design perspective, the latency of floating-point arithmetic operations has been reduced from 3 cycles to 2 cycles, and the latency of multiply-accumulate has been reduced from 5 cycles to 4 cycles. ARM states that the execution bandwidth remains “dual 128-bit ASIMD,” meaning that for Cortex-A75 and previous processors, only one vector pipeline could use 128 bits, while the other was 64 bits; on Cortex-A76, both vector pipelines are 128 bits, so the throughput capacity for 4 precision operations has doubled compared to previous products.
Cortex-A76’s data cache is fixed at 64KB and is designed with 4-way associativity, maintaining a load latency of 4 cycles, with the data tags and lookups running on a separate pipeline. ARM’s design goal is to maximize MLP/memory-level parallelism to support more cores. Additionally, Cortex-A76 has designed 4 different prefetch engines, which can run in parallel to examine various data patterns and load data into the cache.
Regarding the cache hierarchy of Cortex-A76, ARM’s design is very well thought out, achieving a balance between bandwidth and data latency. The 64KB L1 instruction cache and 64KB L1 data cache have read speeds of up to 32Bit/cycle. The L2 cache can be configured as 256KB or 512KB and utilizes second-generation DSU design, with a 2X 32Bit/cycle read and write interface. The L3 cache adopts an exclusive design. Overall, improvements in cache related to the core micro-architecture reportedly can increase memory bandwidth by up to 90%.
▲Design of the L1 data cache in Cortex-A76
▲Cortex-A76’s cache incorporates second-generation DSU
The advantage of Cortex-A76’s storage micro-architecture design lies in optimizing the operation of each cycle to maximize the memory performance of the entire core. During the design phase, engineers studied features that could yield a 0.25% difference in performance or power consumption; if achieved, it was considered a valuable design for the core. Don’t underestimate these percentages; after many small data optimizations, they can lead to significant performance improvements.
▲Cortex-A76’s cache performance significantly improved over previous products
In terms of latency, ARM believes Cortex-A76 has achieved excellence. ARM claims to enable its customers to follow their design specifications on SoCs to achieve maximum performance and take full advantage. For instance, improvements in latency to the main memory per nanosecond will bring a 0.25% performance increase. Just as we saw in the Snapdragon 845, the issue with this SoC was its high latency L4 cache, which caused the final performance to fall short of ARM’s expectations. In the future, ARM’s customers will need to pay more attention to the relevant latency information of the memory subsystem; otherwise, the performance and power consumption of the processor will vary greatly, even exceeding the differences brought by different architectures.
Performance and Power Consumption Predictions
ARM has made predictions regarding the performance and power consumption of Cortex-A76, including differences in micro-architecture design, memory subsystem differences, frequency, and systems.
In terms of general IPC, compared to Cortex-A75, ARM promises a 25% improvement in integer performance, a 35% improvement in ASIMD/floating point performance, and a 90% improvement in memory performance, thus ultimately achieving a 25% increase in GeekBench4, a 35% increase in JavaScript performance, and in AI computing, Cortex-A76’s dual ASIMD 128-bit computing units allow half-precision matrix multiplication performance to reach 3.9 times that of previous products. Considering the improvements in micro-architecture, these data are credible.
▲ARM’s performance predictions for Cortex-A76.
▲Cortex-A76’s overall performance improved by 35% compared to the previous Cortex-A75 under given processes and frequencies.
It should be noted that the Cortex-A76 in the comparison uses the updated TSMC 7nm process, with clock frequencies also being slightly higher. In this part, ARM predicts that Cortex-A76 can reach 3GHz under the 7nm process, with GeekBench4 test performance scores increasing by up to 35%.
How about the single-core performance of Cortex-A76? Taking currently common processors as an example, the Cortex-A76 at 3GHz will have single-core performance close to that of Exynos 9810 and Apple A10, and at 2.5GHz, it will already surpass Snapdragon 845 by a considerable margin. From this perspective, the performance of Cortex-A76 ultimately depends on frequency; in past releases, ARM has often been overly optimistic in this regard, such as Cortex-A73 initially estimated to reach 2.8GHz, and Cortex-A75 even up to 3GHz, but the actual products did not exceed 2.4GHz and 2.8GHz.
Due to different processes and designs, even with the same core architecture, there will be differences in frequency and power consumption. Since mobile performance chips are categorized by performance and power consumption, it is possible to reduce frequency to better balance power consumption and performance, which will lower both. For the first batch of Cortex-A76 products on the market, it may be difficult to reach 3GHz, with estimates mostly around 2.5GHz.
Here, ARM’s predictions are somewhat more aggressive, leaning towards the frequencies achievable on high TDP platforms. ARM also presented a slide showing the processor’s peak performance at 3.3GHz, at which point Cortex-A76’s performance could nearly reach double that of Cortex-A73. It is important to note that the power consumption here exceeds 5W, so the usage scenario may not be for battery-powered small devices.
Next, let’s look at improvements in power consumption. ARM provides data indicating that at 750mW per core power consumption, comparing the 10nm Cortex-A75 and the 7nm Cortex-A76, the latter’s performance increased by 40%; in other words, when running the same SPECRAM2006, the output performance of Cortex-A76 is only half that of the comparison product. In all these tests and comparisons, we have not yet seen more detailed performance comparisons of the processor, including some detailed performance aspects of Cortex-A76 at 3GHz.
From the current process situation, TSMC’s promise is a 40% power reduction for 10nm FF compared to its 16nm FF, but so far TSMC has not shipped any Cortex-A75 products in actual production; in fact, only Samsung’s 10nm LPP has produced Snapdragon 845 processors related to Cortex-A75, and some data suggest that Samsung’s process may slightly outperform TSMC’s 10nm FF.
In terms of energy consumption, ARM cites the performance metrics of SEPCint2006, speculating that ARM used the 2.8GHz Cortex-A75 as a reference in this comparison; if ARM were to compare it with Snapdragon 845, it would likely be comparable to a 2.4GHz Cortex-A76, considering the process improvements, leaving about a 15% architectural advantage for Cortex-A76. However, since Cortex-A76’s goal is a 35% performance increase, as we have seen, increasing frequency to gain performance does not lead to linear growth in power consumption, so the advantage in power consumption and efficiency may quickly diminish at peak performance.
Considering all factors, we believe that the 7nm Cortex-A76 exhibits a slight advantage in energy efficiency at peak performance, which is a very important metric. If we take a more conservative view, at 2.5GHz, the energy efficiency advantage of Cortex-A76 compared to Cortex-A73 and Cortex-A75 will expand to 30%.
Overall, the energy efficiency (energy efficiency ratio) of Cortex-A76 is very high, but it can also be a design controlled by TDP, with high TDP at peak performance, but such processors are often not used in small devices like phones because they need lower frequencies to better control heat. For devices like laptops, Cortex-A76 can use higher frequencies to obtain better performance, as larger devices have better performance in cooling and power supply.
Foundation for the Next Two Generations of Processors
Cortex-A76 is not a huge architecture but is balanced in various aspects. In addition to performance improvements, Cortex-A76 pays great attention to power efficiency in almost every design step, and ARM hopes to achieve an architecture that can have the best of both worlds.
Currently, two manufacturers are collaborating with ARM on products related to Cortex-A76, and relevant products are expected to be released by the end of this year. Among them, Huawei HiSilicon is one of the most important partners, and Qualcomm may also adopt Cortex-A76 in its next-generation products. As for Samsung, since Cortex-A76 does not significantly surpass M3, after improving M3’s power efficiency, Samsung may further launch M4 and may not use ARM’s public version or modified version.
According to ARM’s planning, Cortex-A76 will serve as the foundation for the next two generations of processors, meaning that future new architectures will be developed based on Cortex-A76 to improve performance or enhance energy efficiency. According to ARM’s data, they hope their product’s annual compound growth rate will be 25%, which means that in the coming years, mobile SoCs are expected to catch up with the performance of PC processors, making the market more interesting.