ARM’s position in mobile computing is unparalleled, but that doesn’t mean ARM can afford to stand still. The development of technology is endless, and ARM has maintained the pace of releasing a new architecture every year since a few years ago. Last year, ARM launched the Cortex-A73 core, which focuses on energy efficiency, along with the new Mali-G71 GPU. This year, ARM has introduced the Cortex-A75 and the new Cortex-A55 cores. The former’s high performance is beyond doubt, while the latter, as an upgrade to the popular Cortex-A53, is set to become the most important mobile CPU architecture in the future market. So, what are the features of these two new architectures, and what exciting new technologies has ARM paired with them? Today, let’s explore these topics together.
As a global company, ARM has established three major design teams for developing products in different directions: the Austin team in Texas, USA, primarily develops high-performance products, including Cortex-A15, Cortex-A57, and Cortex-A72; the Cambridge team in the UK is known for its small core products, including Cortex-A5, Cortex-A7, and Cortex-A53; and the European Technology Center in Sophia Antipolis, France, where ARM has developed the Cortex-A12, Cortex-A17, and Cortex-A73 from the Sophia family, characterized by high performance and power efficiency. This time, the two products launched by ARM are the Cortex-A75 and Cortex-A55, with the former focusing on high performance and power efficiency, and the latter focusing on small cores.
▲ARM has launched the new Cortex-A75 and Cortex-A55 architectures
DynamIQ—The Successor to big.LITTLE Technology
Before describing the two new CPUs, let’s talk about the new DynamIQ technology. First, the scale of the core cluster has expanded, and power control has become more flexible. In DynamIQ, ARM allows processors to be freely configured. A cluster can contain processors of different architectures and models, with a maximum of 8 cores per cluster (the bL technology previously allowed a maximum of 4). Up to 32 clusters can exist, with a total core count reaching 256, and this can be further expanded to thousands in the future. Previously, the bL technology allowed only one voltage and frequency control for a specific cluster, but in DynamIQ, a cluster can have up to eight voltage and frequency control domains (each referred to as a control domain), allowing different cores to use different voltage and frequency settings and also allowing individual cores to be turned off.
▲DynamIQ introduces a new cluster design scheme.
Secondly, in terms of cluster configuration, ARM believes that while future processors will still primarily use an eight-core configuration, all eight cores can be placed in one cluster, allowing for any combination of Cortex-A75 and Cortex-A55, whether it’s “1+7”, “2+6”, or “3+5”, “4+4”, etc., making them very suitable. Due to the upgraded power management, cores in different control domains can achieve different frequency and voltage schemes, and individual cores can also be turned off. For example, in the “1+7” scheme, “1” refers to one high-frequency Cortex-A75 big core, providing the most powerful single-thread performance, while the remaining seven are Cortex-A55 cores, providing multi-threading and energy-saving configurations. This scheme can offer up to 2.41 times the single-thread performance and 1.42 times the multi-thread performance compared to the traditional eight-core Cortex-A53 scheme, while the core area is only 1.13 times that of the traditional scheme, with virtually no increase. In contrast, the currently popular “4+4” scheme only improves single-core performance to 1.95 times, while the core area increases to 1.55 times, significantly raising costs.
To achieve the above functions, DynamIQ has become a complex control system, including the DynamIQ Shared Unit for controlling and managing the voltage and frequency of the entire processor core; the DSU serves as the communication hub between CPUs in the cluster and the rest of the system, realizing data transmission control; it also includes asynchronous bridges, snoop filters, L3 caches, Bus I/F, power managers, ACP (Accelerated Consistency Port), and peripheral device I/F, among other components, to manage power, synchronize cores, and connect processors with external devices.
▲DynamIQ can provide core configurations and performance that meet market demands.
Improvements in caching are also one of the highlights of DynamIQ. In DynamIQ, ARM has set L1 and L2 caches to be core-specific caches, reducing L2 cache latency by over 50%. All cores in the cluster can use optional L3 caches, with capacities selectable at 1MB, 2MB, or 4MB. The new L3 cache is designed for 16 ways, technically belonging to pseudo-exclusive design, but ARM states that the L3 cache is completely independent and will not appear in other caches. Moreover, the L3 cache can be partitioned, which may be significant for network or embedded systems running fixed workloads and similar applications with a large amount of deterministic data; the L3 partitioning can be divided into up to 4 groups, and partitions can be set according to tasks and cores. Unallocated L3 cache can be shared among all processors, and partitions are dynamic throughout the process, managed by the operating system or created and adjusted by the hypervisor as needed.
▲Improvements in various aspects of DynamIQ, especially in caching.
In addition to the above, to further improve multi-core efficiency and system performance, ARM has also introduced error reporting technology in DynamIQ, which can report detected errors to software. Additionally, high-speed cache storage is a new feature added this time. This feature allows GPUs or other accelerators to write data into the L3 cache via the ACP or AMBA 5 CHI port, even directly into the specific core’s L2 cache. ARM cites an example where TCP/IP accelerates network devices, allowing some data to be written directly into CPU L2. This way, data does not need to be transferred multiple times in the processor, significantly improving performance while reducing power consumption and reliance on cache coherence mechanisms.
DynamIQ fundamentally changes the overall operation of the system from an architectural level, bringing about more efficient energy utilization and significantly improving the system’s energy efficiency ratio.
Cortex-A75—Scaling Up and Improving IPC
As previously mentioned, ARM’s focus has shifted from absolute performance to a greater emphasis on efficiency and performance-per-watt. Cortex-A73 is a product of this shift. From a macro perspective, Cortex-A75 is a three-issue, out-of-order execution processor with 11 to 13 stages. Compared to the previous Cortex-A73, Cortex-A75 has increased its decoding phase from dual-issue to tri-issue to further improve performance, while also expanding backend resources.
▲Architecture diagram of Cortex-A75.
From an architectural standpoint, the decoders of Cortex-A75 are fundamentally the same as those of Cortex-A73, capable of decoding the vast majority of instructions in a single cycle. Cortex-A75 can decode up to 3 instructions per cycle, with its issuing capacity increasing from the previous Cortex-A73’s 4uops/cycle to 6uops/cycle, marking a 50% improvement. In terms of integers, each issuing queue of Cortex-A75 can provide 2uops, and the ALU and AGU have adopted exclusive designs to enhance efficiency. This allows Cortex-A75 to have an advantage in speculative instruction execution. In terms of peak performance, each pipeline of Cortex-A75 can increase to 8uops, significantly exceeding the previous generation’s performance.
Additionally, like Cortex-A73, Cortex-A75 can bypass the renaming and scheduling stages for some simple branch uops, eliminating delays in these two stages. More complex branch instructions require access to registers and generate additional branches, including ALU and AGU operations, which can hide some additional complexity through renaming and scheduling, thus improving efficiency.
▲Cortex-A75 is designed for high performance.
In the NENO/FP section, neither Cortex-A73 nor Cortex-A75 has a scheduling phase, although uops are still queued and there is load balancing between queues. However, due to differences in processing methods, the floating-point queue is longer than the integer queue by one or two stages. In the NENO/FP section, Cortex-A75 can now send over 3uops/cycle, with each queue capable of “sinking” 2uops.
Further analysis starts from the instruction side. Cortex-A75 is still a “slot-based microarchitecture,” a design successfully applied in the previous Cortex-A73. However, ARM has not disclosed many details to date. Generally speaking, ARM has designed 8 “slots” to eliminate redundant instruction resource consumption and reduce system power consumption.
In terms of prefetching, both Cortex-A73 and Cortex-A75 have designed a very simple instruction prefetcher, providing a 64KB L1 instruction cache, 4-way associative, using a real index scheme, which helps reduce latency. In terms of branch prediction, the previous Cortex-A73 utilized a new branch predictor and a 64-way micro-BTAC to improve prediction efficiency, along with a static branch predictor and a return stack containing nested subroutine return addresses. Given the excellent performance and power efficiency of Cortex-A73’s branch prediction, ARM has fully inherited it in Cortex-A75, with only minor adjustments in aspects like micro-loop prediction, which may improve IPC slightly.
▲Cortex-A75 uses an improved branch prediction unit.
In fact, ARM has been seeking further improvements in IPC, and the biggest improvement in Cortex-A75 is the upgrade from the dual-issue of Cortex-A73 to tri-issue. According to ARM’s data, the IPC of Cortex-A73 is approximately 1.2 units, which can increase to 1.6 to 1.8 in specific tests, but some tests may drop to 0.4 to 0.6. Although a dual-issue processor is adequate, a greater throughput may yield better performance in the current environment. For instance, after a branch prediction error occurs (generally, there may be 2 to 4 prediction errors out of 1000 branch predictions), the CPU needs to quickly refill the pipeline, so greater throughput is needed to expedite pipeline filling. Of course, from a macro perspective, whether to choose tri-issue or dual-issue is influenced by multiple factors, including power consumption, area, and performance, which will lead to a series of other issues, but ARM has clearly made thorough trade-offs.
In the subsequent renaming and scheduling stages, Cortex-A75 and Cortex-A73 are fundamentally similar. Cortex-A75 does not have a reorder buffer or architectural registers; it uses a physical register file to store uop operands, which allows Cortex-A75 to reduce power consumption by limiting the amount of data moved around the CPU and mitigate instruction window bottlenecks caused by using a reorder buffer. Additionally, ARM has strengthened aspects like bypass writing, enhancing core execution order, and optimizing L2 cache misses.
In terms of data and cache, Cortex-A75 has made several adjustments compared to Cortex-A73, such as reconfiguring the L1 and L2 data prefetch sections, with the stride prefetch section being sufficiently optimized for Cortex-A75. From a caching perspective, the L1 cache of Cortex-A73 and Cortex-A75 remains largely unchanged, with the main changes occurring in L2. As previously mentioned, due to the presence of DynamIQ, the traditionally shared L2 cache has become a core-exclusive design. According to ARM’s data, exclusive L2 caches have approximately 50% lower latency compared to shared L2 caches, such as instruction fetch times reducing from the previous 20 to 25 cycles to 11 cycles (in cases of L1 misses and L2 hits).
▲Cortex-A75’s cache design offers higher bandwidth.
In terms of capacity, L2 can be paired with either 256KB or 512KB. In single-core configurations, 512KB can provide a 2% performance boost compared to 256KB, and under four-core configurations, the performance boost can reach up to 4% to 5%. Moreover, both L1 data cache and L2 use a completely exclusive design, so there is no duplication of data between the two, improving space utilization. Overall, ARM has adopted an exclusive L2 cache design for Cortex-A75, significantly reducing latency and improving hit rates while allowing Cortex-A75 to utilize simpler instruction prefetch sections to save power and area, marking ARM’s new choice after reevaluation.
▲Cortex-A75’s fully exclusive L2 cache design achieves approximately 50% performance improvement over the previous generation.
Finally, let’s look at the important pipeline and execution sections. The ALU/INT pipelines of Cortex-A75 are the same as those of Cortex-A73. Both ALUs can perform basic operations such as addition and shifting, but only one ALU can execute integer multiplication and multiply-accumulate, while the other uses a Radix-16 divider to perform integer division. This means that Cortex-A75 cannot complete two integer multiplications or divisions in one cycle, but can execute one integer multiplication/division and one addition or shift simultaneously. In terms of execution efficiency, all executions can be completed in one or two cycles, although more complex operations require additional cycles.
▲Cortex-A75’s computing units support more data formats.
In terms of floating-point calculations, the two 64-bit NENO/FP pipelines in Cortex-A75 and Cortex-A73 have their own dedicated renaming and 128-bit register files, with each SIMD pipeline capable of executing 8 8-bit, 4 16-bit, or 2 32-bit integer or single-precision floating-point calculations, or calculating 1 64-bit integer or double-precision calculation per cycle. It is worth mentioning that since Cortex-A75 has been updated to the ARMv8.2 architecture version, it can also support half-precision FP16.
In summary, the Cortex-A75 architecture achieves ARM’s goals of improving IPC and efficiency through tri-issue and internal improvements, exclusive L2 caches, and new supported formats significantly expanding the application range of the new processor core. Furthermore, with the addition of DynamIQ, ARM can provide better core efficiency allocations. It can be said that the enhancements in Cortex-A75 are timely and very appropriate.
Cortex-A55—More New Features
After looking at Cortex-A75, let’s turn our attention to Cortex-A55. From a macro perspective, Cortex-A55 is still a dual-issue, in-order execution, 8-stage pipeline CPU core. According to ARM’s data, for products at the Cortex-A55 level, an 8-stage pipeline depth is the optimal solution, as no significant frequency improvements have been observed during the transitions from 14nm/16nm to 10nm and then to 7nm (most process gains lead to reduced area and lower dynamic leakage rates). Therefore, continuing to choose an 8-stage pipeline is quite meaningful, which also determines that the frequency of Cortex-A55 will be similar to that of Cortex-A53. Generally speaking, fewer pipeline stages result in lower frequencies, and power consumption and area do not improve significantly; deeper pipelines yield higher frequencies but also increase power consumption as frequency rises.
▲Architecture diagram of Cortex-A55.
From the architecture diagram of Cortex-A55, it still employs a dual-issue architecture, capable of decoding most instructions in each cycle. However, the change in Cortex-A55 lies in the shift to independent load and store units that can execute in parallel, rather than the single combined AGU of Cortex-A53. Furthermore, Cortex-A53 already provided impressive throughput, but the problem is that if the instructions or data to be processed are not ready, or if there are incorrect branch predictions, or if the cache misses, the efficiency of Cortex-A53 will significantly decline, and the core may even stall. Thus, maintaining a steady supply of instructions and data is crucial for in-order cores, which is one of the reasons Cortex-A55 has introduced an improved version of the memory subsystem.
On the instruction side, the L1 instruction cache of Cortex-A55 has now been upgraded to a 4-way associative design, rather than the 2-way of Cortex-A53, but still employs a VIPT (Virtual Indexed, Physical Tagged) design, which is often used for L1 caches as it can significantly reduce latency. Additionally, Cortex-A55 has a 15-entry L1 TLB that supports multiple page sizes. The L1 instruction cache can be configured at sizes of 16KB, 32KB, or 64KB, which is similar to Cortex-A53.
▲Cortex-A55’s L1 cache significantly enhanced.
Next, regarding branch prediction, typically, a new CPU architecture will employ a new branch predictor, or at least a recalibrated one. The new branch predictor of Cortex-A55 adopts a neural network-based algorithm to improve prediction accuracy and includes loop termination prediction to avoid errors at the end of loops. Additionally, Cortex-A55 has introduced a 0-cycle micro-predictor and an indirect predictor before the major conditional predictors to enhance efficiency in special circumstances.
In terms of data, Cortex-A55 uses a brand-new data prefetcher to provide higher bandwidth. This new data prefetcher can monitor more complex cache miss situations and can directly prefetch data from L1, L2, or L3 caches. ARM believes that such a design can positively impact the UI performance of mobile devices. In terms of caching, the L1 data cache remains a 4-way associative design, but it is now a fully exclusive design, meaning all data will not be duplicated in the L2 cache. The sizes of L1 data cache can be configured at 16KB, 32KB, or 64KB. The L1 instruction cache has some changes as well, transitioning from PIPT (Physical Indexed, Physical Tagged) to VIPT (Virtual Indexed, Physical Tagged), which can reduce cache latency in many scenarios. Moreover, the L1 data cache employs a 16-way scheme, significantly increasing bandwidth compared to the previous 10-way scheme, and ARM has also reduced the L1 pointer tracking cycle from 3 cycles to 2 cycles, further lowering latency.
In terms of L2 cache, Cortex-A55 requires collaboration with DynamIQ, thus it also benefits from an exclusive L2 cache that operates at the same frequency as the core, which is fundamentally similar to the improvements in Cortex-A75, with latency reduced by about 50%, from a maximum of 12 cycles to 6 cycles. L2 cache sizes can be configured at 0KB, 64KB, 128KB, and 256KB, with ARM estimating that a large number of users will opt for 128KB, while some applications may choose 256KB. In terms of TLB, the number of L2 TLBs in Cortex-A55 has increased from 512 in Cortex-A53 to 1024, providing more ample space. Furthermore, L2 employs a PIPT design, which is simple and power-efficient. In terms of bandwidth, L2 uses a 4-way design, which is also a normal scheme.
▲Cortex-A55’s L2 cache has also been changed to an exclusive design.
In terms of execution units, the design of Cortex-A55 is fundamentally the same as that of Cortex-A53 (and similar to Cortex-A75). In terms of integers, Cortex-A55 has designed 2 ALUs that can execute addition and shifting operations, but only one of them can handle integer multiplication and multiply-accumulate, while the other uses a Radix-16 divider to execute integer division. In terms of floating-point calculations, considering that some users only require integer calculations and have no need for floating-point, the two 64-bit NENO/FP pipelines of Cortex-A55 have been designed as optional—this is fundamentally similar to the floating-point section of Cortex-A75, including supported specifications and support for ARMv8.2, providing support for FP16 half-precision and INT8 integer calculations.
▲Cortex-A55 significantly strengthens AGU.
▲Cortex-A55’s NENO/FP features an optional design, supporting ARMv8.2.
▲Cortex-A55 also adds a lot of new feature support.
Overall, the improvements in Cortex-A55 mainly focus on branch prediction, data reading and writing (AGU), and caching, with fewer improvements in the execution section. This is mainly due to the small core area and market demand for Cortex-A55. In fact, the specifications of Cortex-A53 are already excellent, and Cortex-A55 only needs simple enhancements.
Performance Preview—Stronger, Faster
As is customary, after each new architecture release, ARM provides some performance previews, and after the release of Cortex-A75 and Cortex-A55, it is no exception. Let’s take a look.
First, let’s compare Cortex-A75 with Cortex-A73. Due to significant differences in microarchitecture, such as the tri-issue and dual-issue differences, it is expected that Cortex-A75 will far outperform Cortex-A73 in many metrics. According to ARM’s data, Cortex-A75 offers approximately a 22% improvement in integer performance, a 33% improvement in floating-point performance, a 16% improvement in memory performance, a 48% improvement in rendering performance, and a 34% improvement in GeekBench overall performance compared to Cortex-A73, essentially winning across the board.
▲Performance comparison of Cortex-A75 against various cores.
In addition to pure performance, ARM has also provided performance data for Cortex-A73 and Cortex-A75 at different power consumption ratios. At 750mW, Cortex-A75 outperforms Cortex-A73 by about 20%; at 1W, it is about 25% higher; and at 2W, it exceeds by 30%. These data indicate that the performance of Cortex-A75 is highly correlated with power consumption, achieving more prominent performance growth at higher power consumption and frequencies. Furthermore, comparing products across different processes, ARM believes that the Cortex-A75 operating at 3GHz on a 10nm process can deliver performance approximately 2.5 times that of a 20nm, 2.1GHz A57 processor.
▲Cortex-A75’s performance comparison against Cortex-A73.
▲Cortex-A75’s performance under the same power consumption compared to Cortex-A73.
Next, let’s look at the performance of Cortex-A55. Although the execution section of Cortex-A55 has not changed significantly, its memory performance has been greatly improved, which also confirms that ARM’s judgment regarding the bottleneck of Cortex-A53 lies in memory. ARM claims that the memory performance of Cortex-A55 has doubled compared to Cortex-A53, leading to improvements of 18% in integers, 38% in floating points, 14% in rendering, and 21% in overall performance. Of course, achieving such a high performance increase is impossible without increasing power consumption; compared to Cortex-A53, the power consumption of Cortex-A55 has increased by 3%, but considering the performance increase, the overall energy efficiency ratio has improved by another 15%. It is noteworthy that Cortex-A53 had already achieved an excellent performance-to-power ratio, and this time’s 15% increase is no small feat.
▲Cortex-A55’s performance comparison against Cortex-A53.
▲Cortex-A55’s power consumption slightly rises, but the performance-to-power ratio also significantly improves by 15%.
Cortex-A75 and Cortex-A55—Laying the Foundation for the Future
Through this article introducing these two new cores, I believe everyone has gained a certain understanding of their basic architectures and performance. From a product perspective, Cortex-A75 and Cortex-A55 are ARM’s most important layout in the 10nm era, with their high performance, high performance-to-power ratio, and comprehensive feature support enabling ARM to achieve success in various markets of mobile computing.
▲ARM aims to seize opportunities in the upcoming AI era by continuously launching new architectures and technologies.
From the perspective of ARM’s product planning, the Sophia family will become the core product of mobile computing in the 10nm era, with Cortex-A75 replacing the current Cortex-A72 from the Austin family. Due to considerations regarding mobile performance-to-power ratios and other factors, ARM has removed some features from Cortex-A75 and Cortex-A55, which are expected to be added back in future Austin family processors competing in markets like automotive and critical security devices. Of course, the return of the Austin family products in the 7nm era will further enhance the performance of mobile computing devices, which is beyond doubt.
As for specific products, it is estimated that companies such as Huawei, Samsung, and Qualcomm have already begun to deeply engage and propose their demands during the architecture development phase. Following the release, it is expected that manufacturers will launch products using Cortex-A75 and Cortex-A55 architectures and DynamIQ as early as the first quarter of 2018 or even the fourth quarter of this year, with actual applications in mobile phones likely not later than the second quarter of next year. By then, we will see what kind of energy the new generation of SoCs with new GPUs and CPU architectures will unleash.
Leave a Comment
Your email address will not be published. Required fields are marked *