The Cortex A77 architecture launched by ARM in May this year uses TSMC’s 7nm process and has a peak frequency of 3GHz, with a performance improvement of 20%. In a previous article, the architecture of the strongest X86 processor, ZEN, was introduced. See more details.This article analyzes ARM’s strongest processor A77 based on the same principles, delving into its design scheme and the similarities and differences with the X86 architecture.
First, let’s briefly introduce the ARM instruction set architecture.Currently, general-purpose processors are basically divided into two camps: one led by INTEL and AMD with CISC (Complex Instruction Set Computer) instruction sets, and the other led by ARM with RISC (Reduced Instruction Set Computer) instruction sets.The main difference between the two is the singularity of instruction functions.RISC ISA typically completes a simple, independent operation or control with a single instruction, with a fixed instruction length and a relatively uniform format.In contrast, CISC ISA instructions are much more complex, with variable instruction lengths and complex formats.Due to the favorable hardware implementation of RISC instruction encoding and functions, processors have developed to the point where both X86 and ARM now have hardware execution cores based on the RISCV architecture. The X86 adds a layer that translates CISC instructions into RISC-like microinstructions, thereby increasing the complexity of the decoding section and adding additional pipeline stages. This is also a significant factor in why X86 processors consume more power than RISC processors at the same level.
The Cortex A77 targets the high-performance mobile sector and adopts the ARMv8.2 64-bit instruction set architecture.In terms of hardware design, it inherits from the A76, also using a 7nm process, and the peak frequency remains unchanged.From this point, the pipeline structure of the A77 should be consistent with the A76. The 20% improvement mainly lies in the details of the microarchitecture, aimed at enhancing IPC and parallel execution capabilities.As the process advances to 7nm, the power density of a single chip greatly restricts the increase in frequency.ARM primarily targets the mobile market and does not align its frequency control with INTEL, instead pursuing performance per unit of power.Therefore, many designs do not prioritize frequency, and we can see that its L1 cache size reaches 64KB, even exceeding ZEN2’s 32K.Other features like DynamIQ, big.LITTLE, etc., are basically standard for ARM.
The pipeline structure of the A77 has not changed much; it is still a standard physical register Out-of-Order machine.There are several noteworthy points.The first is the 1.5K entry Mop Cache.This structure, which has existed in X86 for a long time, finally appears in ARM processors.X86, being a complex instruction set, introduces Mop Cache to store decoded microinstructions, thus allowing a direct bypass of the fetch and decode pipelines, achieving greater dispatch width.We see that this Mop cache is placed at the fetch stage, after the result mux of Icache, and is sent to the decode module uniformly.If this structure is correct, the primary purpose of this Mop cache is power consumption control and reducing branch penalty.The decode stage has increased to 6 instructions, while also widening the issue width, adding 1 ALU and 1 BRU.Thus, the execution units of A76 have 4 ALUs, 2 BRUs, and 2 load-store pipes.It can be seen that Apple’s processor design has a significant influence on ARM.As everyone is significantly improving concurrency and pushing up single-core peak performance, ARM cannot remain isolated and continue its extreme energy efficiency design.Another reason may also be related to the resurgence of ARM servers. With Huawei and Amazon launching their self-designed ARM server chips, the ARM server market, which had cooled for several years, seems to be bustling again.In this context, ARM also needs a processor with single-core performance that can compete with the X86 camp, which can drive more manufacturers to enter and challenge X86.The configuration of the A77 can be said to meet the needs of both the high-end mobile market and the entry-level server market.
The most important change in the front-end pipeline is the addition of the Mop cache.Assuming its width is consistent with the number of decodes, the 1.5K entries can store 9,000 32-bit instructions, which should cover most application scenarios in the mobile field.After the Mop cache warms up, instructions can be sent directly to the decode stage without going through the Icache path, allowing the entire fetch unit to enter a low power state.At the same time, it stores information about the decoded instructions, including branch and loop prediction results, enabling zero-cycle hardware loops, further improving loop execution efficiency.Third, fetching from the Mop reduces the length of the fetch pipeline, so if a branch misprediction occurs and the new target is also in the Mop, the flush pipeline penalty will also be significantly reduced.It also mentions a dynamic code optimization method, which seems to learn the characteristics of code sequences to enhance the capabilities of later-stage execution, but it’s unclear how ARM does this.
Branch prediction is another key optimization method.It can be seen that it is in line with the direction of ZEN2, which is also the most important performance metric of the front-end pipeline.The A77 broadens the branch prediction bandwidth to 64B, theoretically allowing it to predict the branch results of 16 32-bit instructions simultaneously.It also significantly increases the size of the BTB, similar to ZEN.It can be observed that the A77 lacks an L0 BTB compared to ZEN, having only 2 levels of BTB.Without specific data, it is difficult to say which scheme is better; it should be based on testing in their respective application scenarios.The common trend is for larger predictors and prediction widths.
The decode stage mainly increases the dispatch width by 50%, theoretically allowing 6 instructions to be executed concurrently, providing greater parallel execution capabilities.Consequently, the number of ROB entries also increases to 160.It mentions speeding up the update speed of the renaming table after a branch misprediction.Typically, the renaming table is restored to the actual renaming state when the corresponding branch retires.The mentioned acceleration may provide multiple branch recovery points, allowing for direct recovery to the nearest recovery point without waiting for retirement.This would increase both hardware complexity and area.Due to the lack of more detailed information, we can only make reasonable speculations.
The execution units mainly increase in number, aligning with Apple.Note that ARM has maintained single-cycle ALUs, which are crucial for single-core IPC performance, and ZEN2 has also maintained this.The A77 adopts a Unified Issue Queue, which differs from ZEN’s separated design.An integrated IQ can yield better scheduling results but significantly restricts frequency increases.This reflects that ARM does not prioritize frequency but considers overall power consumption more.Another change is the addition of a crypto pipeline to enhance AES encoding and decoding capabilities.On this point, the author remains reserved.Currently, crypto acceleration is usually achieved using dedicated accelerators since the algorithms are fixed, allowing ASICs to achieve very high acceleration ratios with relatively low hardware costs.Moreover, ASIC-based crypto accelerators can be completely isolated from processors, achieving pure hardware encryption and decryption, thus ensuring a high level of security.In contrast, implementing crypto on processors has a significant performance and power gap, and the instruction implementation has a higher degree of software involvement, making security harder to guarantee.
The load-store pipeline also adopts a unified Issue Queue.It can be observed that the A77 has 2 address paths and 2 store data paths, allowing it to execute 2 store instructions simultaneously.The combination may be 2 loads, 2 stores, or 1 load + 1 store.The decision to adopt 2 store pipes is quite aggressive, likely aimed at achieving higher memory transfer performance.For typical applications, this additional store data path may not have much effect.
Here, ARM emphasizes its pre-fetching mechanism.Data prefetching can effectively hide the access latency of system memory, which is crucial for high-performance processors.Typical prefetching relies on the execution characteristics of the data, supporting one-dimensional or multi-dimensional stride-based prefetching.The A77 proposes a system-aware prefetching mechanism, claiming to provide higher prefetch capabilities tailored to the characteristics of the storage subsystem.Due to limited information, it is unclear how this system-aware feature is implemented; it may dynamically adjust the number and strategy of prefetching based on different memory latencies and the usage rates of L3 cache across cores.
According to the performance metrics provided by ARM, the A77 exhibits a high energy efficiency ratio, especially in terms of power consumption, which is ARM’s core competitive advantage.
From the technical specifications of the A77, the design of general-purpose processors shows the following trends:First, as explained in the article on “ZEN2”, the microarchitecture of general-purpose processors has stabilized, and there are no new tricks at the large scale; improvements are being made in details, such as wider launch paths, more execution units, and larger predictors.This provides a great opportunity for latecomers to catch up.Secondly, the design of ARM’s mobile processors is gradually converging with the structures of INTEL and AMD in desktop processors; apart from the lower core frequency, other metrics are not significantly different, and many technologies unique to X86 are gradually appearing in ARM’s designs.As the process approaches its physical limits, the gap between ARM and INTEL is rapidly narrowing, and it is likely that in the near future, ARM processors will have the capital to compete with X86 in the server market.Thirdly, the reduction in process size has led to a significant increase in power density, becoming a key factor limiting core frequency increases.It can be seen that the A77 does not support higher frequencies, and based on previous SOC designs, the final chip is estimated to maintain around 2.5G.Given the more stringent cooling conditions in mobile environments, typical applications may use multiple cores working at lower frequencies, while larger applications like games may increase single-core frequencies while reducing other cores or putting them into low power states, keeping the overall chip power consumption within controllable limits.INTEL’s Turbo Boost technology operates in a similar manner.Power consumption will remain a bottleneck for processor performance improvements for a long time, posing higher challenges for designers to provide flexible execution methods based on specific application scenarios.
Thus, leaders face numerous constraints, while latecomers can strive to catch up.Now may be a great opportunity for the development of domestic processors.For example, a few days ago, Tsinghua Unigroup announced that its Xuantie 910 has comprehensive performance that can match ARM A72, marking the smallest gap between domestic processors and world-class levels.It may not be long before we can catch up with ARM’s pace in the mobile market.However, ARM’s advantages lie more in its ecosystem; whether the emerging RISC-V can break this limitation and establish its own ecosystem centered on open source will be the key factor on the road to commercialization for domestic processors.We look forward to this day arriving soon.
[1] Images in this article are sourced from ARM Tech Day Presentation, May 26, 2019.