▲ Click above Leifeng Network to follow
Written by | Ren Ran
Reported by Leifeng Network (leiphone-sz)
Reference source: Anandtech
https://www.anandtech.com/show/12785/arm-cortex-a76-cpu-unveiled-7nm-powerhouse
It’s June again, and Arm has unveiled the brand new Cortex A76 architecture in San Francisco.
Digital enthusiasts are likely familiar with Arm’s architecture codenames, but may not know who designs them. In fact, Arm has three design teams globally, located in Austin, Texas, the Sophia team in southern France, and the Cambridge team in the UK.
These three teams have their own roles; the Austin team is responsible for designing high-performance architectures, represented by Cortex A57 and Cortex A72; the Cambridge team specializes in low-power architectures like Cortex A53 and Cortex A55; while the Sophia team focuses on balance, producing Cortex A73 and Cortex A75.
However, since the slowdown of Moore’s Law at the 28nm node, the Austin team has encountered bottlenecks twice with the Cortex A57 and Cortex A72 architectures. While performance is strong, power consumption and heat generation are also significant. For several years after that, the Austin team had little activity. Just when people had almost forgotten about this group of Americans, the Austin team returned with the brand new Cortex A76.
From a design perspective, the Cortex A76 is crucial for Arm; it is a completely re-engineered microarchitecture, the leader of the “second generation Austin family,” representing a new beginning. Arm describes it as “a processor with laptop-level performance.”
With the latest 7nm process, the Cortex A76 is expected to reach a frequency of 3GHz, compared to the Cortex A75, which is manufactured on a 10nm process running at 2.8GHz, with a 40% reduction in power consumption and a 35% increase in performance, as well as a fourfold improvement in machine learning capability.
Analysis of Cortex A76 Architecture
The Cortex A76 is an out-of-order superscalar core, featuring a front end with out-of-order 4-issue instruction decoding and a 13-stage pipeline with an execution delay of 11 stages. Arm has designed a “directional prediction fetch” unit, meaning the branch prediction unit feeds back to the instruction fetch unit. Arm also pioneered the use of a “hybrid indirect prediction unit,” separating the prediction unit from the instruction fetch unit, allowing independent operation of various modules within the core, making it easier to implement clock gating during operation to save power.
The branch prediction unit of the Cortex A76 is supported by a 3-level BTB (branch target buffer), including a 16-entry nanoBTB, a 64-entry microBTB, and a 6000-entry main BTB. In the Cortex A73 and Cortex A75 generations, Arm claimed that its branch prediction unit could predict nearly all branches, and this new unit in the Cortex A76 seems even stronger.
The instruction fetch unit operates at a speed of 16Bytes per clock cycle, while the branch prediction unit operates at twice the speed of the instruction fetch unit, at 32Bytes per cycle, providing an instruction queue before the instruction fetch unit composed of 12 “blocks.” This design aims to hide branch bubbles in the pipeline during branch prediction errors, preventing the instruction fetch unit and the rest of the core from stalling. Arm claims that the Cortex A76 can handle up to 8 branch prediction errors per cycle.
The instruction fetch unit of the Cortex A76 can provide up to 16 32-bit instructions, with the fetch pipeline consisting of 2 instruction alignment and decoding loops. In the instruction decoding and renaming stage, the Cortex A76 can throughput 4 instructions per cycle and output macro instructions at an average rate of 1.06Mops per instruction.
Previously, Cortex A72 and Cortex A75 could throughput 3 instructions per cycle, while Cortex A73 could only throughput 2. According to information obtained by Leifeng Network, the decrease in decoding bandwidth of Cortex A73 compared to Cortex A72 was to optimize energy efficiency, while the Cortex A75 restored the 3 throughput design due to the increasing demand for performance in mobile processors. The Cortex A76 further advances, becoming the architecture with the highest decoding bandwidth in the public version, but still lower than the custom architectures of Samsung and Apple (Samsung M3 at 6 throughput per cycle/Apple A11 at 7 throughput per cycle).
In the instruction renaming stage, Arm separated the renaming unit and applied clock gating for integer/ASIMD/tag operations, reducing renaming and scheduling from 2 cycles in A73 and A75 to 1 cycle. Macro instructions are expanded into micro-operations at a ratio of 1.2μop per instruction, executing 8μops per cycle, significantly improving compared to 6μops/cycle in Cortex A75 and 4μops/cycle in Cortex A73.
The out-of-order commit window size of the Cortex A76 is 128, and the buffer is divided into two structures responsible for instruction management and register reclamation, known as the hybrid commit system. Due to the performance scaling ratio of only 1/7, meaning a 7% increase in buffer size can only improve performance by 1%, Arm did not focus on enhancing this part of the design.
The pipeline aspect includes 6 issue queues and execution ports in the integer part, with a total of 3 integer execution pipelines, served by a 16-depth issue queue. Among them, 2 integer pipelines can execute simple arithmetic operations, and 1 can execute complex operations such as multiplication, division, and CRC. The ASIMD/floating-point part includes 2 pipelines, served by 2 16-depth issue queues.
In integer operations, Cortex A76 reduces the multiplication and multiply-accumulate delay from 3 cycles in Cortex A75 to 2 cycles, maintaining the same throughput. With 3 integer pipelines, the throughput for simple arithmetic operations increases by 50% compared to the 2 pipelines of Cortex A75.
In the “VX” (Vector Execution) pipeline responsible for floating-point and ASIMD operations, Arm has also made significant improvements. The floating-point arithmetic operation delay in Cortex A76 has been reduced from 3 cycles to 2 cycles, and the multiply-accumulate has been reduced from 5 cycles to 4 cycles. Arm states that the double 128bit ASIMD in Cortex A76 can provide double the execution bandwidth, and the execution throughput for quadruple precision operations has doubled compared to Cortex A75.
Arm has also introduced the fourth-generation prefetch unit in Cortex A76, with 4 different prefetch engines running in parallel for each core, examining various data patterns and loading data into caches to approach the goal of perfect cache hit operations. Arm has made no compromises in the cache system design of the Cortex A76, achieving near-perfect levels in both bandwidth and latency, reportedly increasing cache bandwidth by up to 90%.
Performance and Power Consumption Predictions
In summary of these architectural improvements, Arm claims that the Cortex A76 offers a 25% and 35% increase in integer and floating-point performance per cycle, respectively, along with up to a 90% increase in cache bandwidth, resulting in a 28% improvement in GeekBench4 scores and approximately a 35% increase in JavaScript performance (Octane, JetStream).
Arm provided performance comparisons for running the SPECint2006 test, where the Cortex A76 at 2.4GHz outperformed the Snapdragon 845, with a 15% performance increase at the same frequency. Of course, the frequency benefits brought by semiconductor processes are also crucial for SoC performance improvements. If TSMC’s 7nm process is successfully put into production, running the Cortex A76 at frequencies above 3GHz will match the performance of the new Exynos 9810 using Samsung’s self-developed M3 architecture.
In addition to performance enhancements, the energy efficiency ratio of the Cortex A76 has also improved. Under a 750mW core power budget, the 7nm Cortex A76 can provide a 40% performance increase compared to the 10nm Cortex A75. Arm states that the Cortex A76 can maintain full-speed operation without frequency drop when all four cores are fully loaded.
However, previous frequency targets set by Arm have often been overly optimistic. For example, the initial expectation for Cortex A73 was 2.8GHz, and for Cortex A75 it was 3GHz, while their actual maximum operating frequencies were only 2.45GHz and 2.7GHz, respectively. For semiconductor suppliers, process maturity and differences between pipelines can affect chip operating frequency, and lowering the frequency limit is a necessary compromise to ensure supply.
Furthermore, according to Leifeng Network, each core architecture has an optimal frequency range for energy efficiency in a given process. For example, the CPU big core cluster of the new Exynos 9810 using Samsung’s self-developed M3 architecture operates at 2.7GHz, 2.3GHz, and 1.8GHz under single-core, dual-core, and quad-core full load conditions, respectively, with power consumption around 3.5 watts. In other words, reverse engineering indicates that increasing frequency from 1.8GHz to 2.3GHz doubles power consumption, while increasing from 2.3GHz to 2.7GHz only results in a power increase with a 400MHz gain.
From 1.8GHz to 2.7GHz, even if performance increases linearly, the magnitude is only 50%, while power consumption quadruples. This shows that exceeding the optimal energy efficiency interval incurs a significant power cost. The performance of the Snapdragon 845’s Kryo 385 Gold core is also similar; when exceeding the threshold around 2.1GHz, the power consumption increases even more than that of Samsung’s M3 core.
Therefore, the first SoCs using the Cortex A76 architecture are likely to still not reach 3GHz. Leifeng Network believes that considering changes in core architecture and growth in scale, the actual frequency will be around 2.5GHz, but it is not ruled out that as later processes mature or when applied to devices like laptops that have more relaxed power consumption requirements, they may reach frequencies above 3GHz.
Conclusion and Thoughts
In recent years, people have been eagerly awaiting a powerful architecture that can compete with Apple. Although Samsung’s recently launched self-developed M3 architecture has approached Apple A11 in performance, it comes at a terrifying power consumption of 3.5W per core. In this context, Arm still chooses to proceed steadily with generational replacements. The Cortex A76 from the Austin team is not a performance monster; it fully demonstrates how important a balanced microarchitecture is.
It is reported that Qualcomm and Huawei HiSilicon are already preparing for the development and production of Cortex A76 SoCs, and we are likely to see them in commercial products by the end of this year. Samsung’s situation is more nuanced; the performance of the Cortex A76 does not surpass that of M3, so theoretically, Samsung only needs to focus on improving the energy efficiency of M4 (if it exists).
If all goes well, the Cortex A76 architecture will undergo at least two iterations in the coming years. Arm has achieved its annual planning goals for five consecutive years, with a compound annual growth rate of 20-25%. As mobile processors rapidly approach the performance of X86 processors, the processor market will become even more interesting in the coming years.
Leifeng Network is recruiting editors, operators, part-time, and translation positions
For details, click on Recruitment Notice
◆ ◆ ◆
Recommended Reading
-
Why have more than 80% of AI chip startups in China laid out in the autonomous driving and security markets?
-
Arm releases 4 new GPU products at once, enabling entry-level phones to utilize machine learning
Follow Leifeng Network (leiphone-sz) and reply 2 to join the reader group and make friends