Cortex-X1: The Bigger Arm

Cortex-X1: The Bigger Arm
Arm targets larger, bigger, and smaller designs with X1 and Cortex-A78
Linley Gwennap (June 1, 2020)

To assist mobile users in developing a three-layer CPU cluster, Arm has introduced two new high-end cores: the larger Cortex-A78 and the even larger Cortex-X1. The X1 is the first in the more powerful mobile CPU product line, aiming for a 30% performance increase in instructions per clock (IPC) compared to the previous Cortex-A77. To achieve this performance boost, the microarchitecture adds prefetch bandwidth, a larger reorder window, and other key parameters. Two new Neon units (four in total) will double the processing capability for AI and other math-intensive computations.

After a significant IPC improvement over the previous generation, the A78 focuses on area and efficiency enhancements. While reducing area and power consumption, it still achieves a 7% IPC improvement over the A77. The company concentrated primarily on a third load unit and doubled the branch prediction bandwidth. Other changes are minor, incremental upgrades. Efforts in energy efficiency extend battery life and achieve sustained performance, while the X1 is aimed at applications that require peak performance quickly.

In high-end mobile processors, Huawei, Qualcomm, and Samsung have adopted different approaches to improve the performance of their top one or two CPUs, as shown in Figure 1. These processors typically add several standard large cores and four small cores, such as Cortex-A55, extending Arm’s Big.Little method to three layers. The X1 is expected to become the high-performance core of these flagship products, forming the so-called Bigger.Big.Little Cluster. The production version RTL code for A78 and X1 is already available, with the first processors based on this code entering mass production later this year.

Cortex-X1: The Bigger Arm

Figure 1 Three-layer CPU design
(Clock frequency, Cache size, and microarchitecture can be varied to build the Bigger.Big.Little Cluster. For next year’s high-end phones, focus is expected on Cortex-X1 and Cortex-A78.)
Double Execution

Releasing two new CPUs simultaneously is an achievement, but the strong correlation between the two Arm designs simplifies the task. The updates for Cortex-A78 are also fully applied to Cortex-X1. X1 maximizes performance by unlocking several switches, while A78 is optimized for efficiency. For example, X1 expands the reorder buffer and L0 (Mop) Cache while also replicating existing Neon units. Although these RTL modifications are relatively small, they require significant changes to the scheduler and other control logic. Most validation must be conducted separately on each design. The design team in Austin has put in tremendous effort to parallelly develop both products.

For Cortex-A78, Arm is committed to finding opportunities to boost performance with minimal or no extra area overhead. They even found ways to reduce die area with minimal performance loss. The result is that the overall IPC gain is less than the previous generation, averaging 7% across multiple benchmarks, as shown in Figure 2. However, Arm estimates that the die area (including L1 and L2 Cache) is 5% smaller than Cortex-A77, with a 13% performance increase per square unit. Part of the area reduction comes from reducing the L1 Cache from 64KB to 32KB, although customers can choose to retain the larger size.

A smaller die area helps reduce power consumption by 4% (at the same clock frequency), improving energy efficiency by 11%. Given the limited thermal capacity of smartphones, this indicates an 11% performance gain at the same power consumption. This advantage is significant when running fixed tasks on dual CPUs. In this case, to maintain the same performance, A78 supports operation at lower clock frequencies and voltages compared to A77, reducing power consumption by 36%.

X1 achieves a 30% performance increase in SPECint2006 and SPECfp2006, although the gains in STram (Memory Test) and Octane are slightly lower, as shown in Figure 2. For this design, the company’s goal is to maximize performance without considering die area or power consumption, as customers typically implement only one or two X1 cores within a single cluster, since this core usually runs for a short time to minimize the impact on cooling and battery life.

Cortex-X1: The Bigger Arm

Figure 2 Performance, Power, and Area (PPA) Comparison
(Cortex-X1 provides overall performance improvement, while Cortex-A78 is more efficient in power consumption and chip area. The comparison is based on the same clock frequency and IC process conditions. A77 integrates 64KBL1, 512KB L2, and 4MB L3 Cache, A78 integrates a smaller 32KBL1 Cache, and X1 integrates a larger 8MB L3. * SPEC2006. (Data Source: Arm, †The Linley Group estimates))

However, the performance improvement comes at a significant cost. Arm has declined to provide die area and power consumption related to X1, typically offering such data for other cores. Assuming A78 removes the new X1 features, these additions would clearly harm rather than enhance energy efficiency. A 30% performance gain for X1 compared to A77 is expected to incur 50% more area and power overhead. This primarily stems from the two large Neon units and other microarchitectural extensions. As a result, the new core’s energy efficiency (performance per watt and performance per square) is lower than A78 or other mainstream Cortex-A designs.

Both cores utilize the same pipeline as A77, capable of operating at up to 2.8GHz in a 7nm process. Arm predicts that A78 and X1 can operate at the same frequency, achieving 3GHz in a 5nm process. At 2.1GHz, the 5nm A78 consumes 1W, achieving a SPECint2006 score of 30.

Cache Savings

At a high level, Cortex-A78 is similar to its predecessor, offering the same prefetch and decode widths, as well as the same functional units (see MPR 5/27/19, “Cortex-A77 Improves IPC”). However, while A77 has always used 64KB instruction and data Cache, A78 can opt for either 3KB or 64KB Cache. These Caches consume a significant portion of CPU area, and Arm has determined that reducing to 32KB incurs an obvious performance loss, though smaller area savings are preferable. Processors integrating Cortex-X1 and A78 may use a 32KB configuration even if X1 can handle heavy tasks. However, when A78 serves as the system’s highest-performance core, a 64KB Cache may be more appropriate. For optimal performance, X1 uses a 64KB L1 Cache and maximizes L2 Cache to 1MB.

X1 prefetches 20 bytes from the instruction Cache per cycle, four more than A78 or A77. As shown in Figure 3, a new fifth instruction decoder is added to support single-cycle prefetch and decode of five instructions. The decoded instructions (referred to as macro-operations, or Mops) can be launched immediately but first flow into the Mop Cache, a feature that first appeared in A77. X1 doubles the size of the Mop cache to 3072 entries, increasing the single-cycle output rate from 6 to 8 Mops. The larger cache achieves over 90% hit rates for SPECint2006 and many other programs, exceeding A77’s 85%. Moreover, the CPU can issue 8 instructions in most cases. The Mop cache improves performance by enhancing the issue rate while saving power by disabling multi-cycle decoders.

Cortex-X1: The Bigger Arm

Figure 3 Cortex-X1 Microarchitecture
(To enhance performance, Arm made extensive changes to Cortex-A77 (red sections), including a larger macro-operation (Mop) Cache, increased prefetch and decode bandwidth, new address generation units (AGU), and doubling the number of Neon units. * Including FDIV and IMUL units.)

The new branch predictor supports processing two branches in a single cycle, doubling the capability of previous designs. This feature can predict the first instruction group, and if those instructions contain predicted branches, it can predict branches at the new target—all within one cycle. This change is particularly important for X1, as it can extract 8 instructions from the Mop Cache in a single cycle. Since some software averages one branch per four instructions, a group of eight instructions typically contains two branches. Even A78 encounters this situation when fetching 6 Mops from Cache.

For this generation, Arm made subtle modifications to the branch prediction algorithms to enhance accuracy. X1 also increases the L0 branch target buffer by 50% to handle certain encountered branch bubbles. Conversely, A78 reduces the size of some branch prediction structures, incurring a minor performance loss compared to area savings. Algorithmic improvements help offset the loss in prediction accuracy.

The Benefits of Packing

To enhance the processing capability of standard integer code, Cortex-X1 adds a third address generation unit (AGU), supporting three load operations in a single cycle. This design is still limited to two store ports, with the new AGU used only for load operations. Two new Neon units handle SIMD and floating-point instructions. Thus, the CPU can execute 128-bit operations in a single cycle, equivalent to 512-bit SIMD calculations.

Both new CPUs implement a feature called register packing. While Arm v8 is a 64-bit architecture, it retains many 32-bit instructions that programs need to use. These instructions ignore the high 32 bits of each register, saving space. The new design can divide the entries of the physical register file into 32-bit or even 16-bit registers. For example, when executing a LDRW (load word) instruction to R3, the renaming table can allocate R3 to a 32-bit physical register, leaving the other 32 bits for other values. Thus, the PRF can store more values without increasing size.

The new reorder buffer (ROB) adopts a similar strategy. While it does not store complete instructions, each entry retains a set of status bits for the executing instruction. Different instruction types require different status bits, and each ROB entry must be wide enough to hold the majority of complex instructions. The new design is more efficient, storing only the necessary bits for each instruction. In this way, sometimes the ROB can pack two simple instructions into a single entry. The extent of this benefit depends on the instruction combination, but Arm estimates that it can achieve a 20% efficiency improvement for most programs.

Cortex-A78

Cortex-A78 retains the same 160 ROB entries as its predecessor, but X1 expands to 224 entries, supporting about 270 effective entries based on that 20% improvement. The larger ROB supports more instruction parallelism and helps accommodate higher prefetch and execution rates.

Both designs double the store bandwidth, increasing the data cache to a single-cycle 32 bytes. While Cortex-A77 has two store ports, it can execute a maximum of one 128-bit store (or two 64-bit store operations) per cycle. Doubling the bandwidth allows support for two 128-bit store operations. This modification is especially crucial for X1, given its two Neon units. The data Cache of X1 can execute three 128-bit load operations in a single cycle, supporting the third AGU. The bandwidth from L2 cache to data cache is also doubled. To further enhance performance, X1 increases the sizes of multiple load and store queues by 33%, enlarging the L2 TLB size by 67% to 2048 entries. A78 actually trimmed the number of L2 TLB entries to save area.

Both cores are compatible with Arm’s DynamIQ architecture, allowing designers to mix and match different Cortex CPUs within a single coherent cluster. Figure 4 shows a three-layer cluster containing one X1, three A78, and four A55. All eight cores connect to the SynamIQ shared unit (DSU), sharing data and a common L3 Cache. The new DSU supports a maximum of 8MB L3 Cache. A78 supports a maximum of 512KB L2 Cache, while X1 supports a maximum of 1MB L2 Cache.

Cortex-X1: The Bigger Arm

Figure 4 Next Generation DynamIQ CPU Cluster
(This octa-core design integrates a single Cortex-X1, three Cortex-A78, and four Cortex-A55. The area is roughly proportional to chip area)
Struck by Lightning

The only mobile Arm CPU that can rival Cortex-X1 is Apple’s custom design. The latest, codenamed Lightning, appeared in the company’s A13 SoC, which began shipping in the iPhone 11 last September. Although Apple conceals microarchitectural details, it is reported that Lightning can decode 7 instructions per cycle and has 6 integer ALUs. As shown in Table 1, both figures are significantly higher than X1, giving Lightning an edge in standard integer code. However, the drawback of Apple CPUs is that they integrate only two load/store units (missing the third load unit of X1) and three Neon units (also missing one), resulting in slightly lower efficiency in SIMD and FP-intensive code execution. Its two Lightning CPUs share a massive 8MB Cache, which Apple calls L2 but is functionally similar to Arm’s shared L3.

In single-core benchmarks like Geekbench 5, the A13 is the fastest Cortex-A77 CPU, scoring 1,328 in the iPhone 11, while the Galaxy S20 (with Snapdragon 865) scores 839. Based on Arm’s 30% claim, the X1 CPU running at 2.8GHz is estimated to score around 1,100, narrowing the gap but still falling short. It should run slightly faster on 5nm, but by the time X1 products ship, Apple will have its own 5nm processors.

Pricing and Availability
Cortex-A78 and Cortex-X1 RTL have been available to major customers since last year. The first chips implementing these designs will enter mass production later this year. Arm has not disclosed licensing fees but states that X1 is available only to a select few customers who prepay for core development. For more online information about A78, visit developer.arm.com/ip-products/processors/cortex-a/cortex-a78. For more online information about X1, visit community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-cortex-x custom-program.
Apple achieves this lead by allocating more CPU silicon area than other Arm customers.According to the author’s measurements, Lightning’s chip area is about 4.5 square millimeters, integrating a large L2 Cache.However, comparing this figure with X1 lacks fairness due to the Cache size difference.If both CPUs are normalized to 512KB L2, Lightning would be about 3.0mm2, slightly larger than X1 (estimated at 2.1mm2 (7nm)).For equal performance, X1’s chip area falls between A77 and Lightning.
Table 1 also presents Intel’s latest Sunny Cove CPU, found in the Ice Lake processors for tablets and laptops (see MPR 2/4/19, “Intel’s Sunny Cove Sits on an Icy Lake”). X1 is comparable to x86 designs in many respects, even if it lacks ROB entries. When the CPU is limited to 3.8GHz, X1 performs even better than Sunny Cove on Geekbench—this is its maximum frequency in mobile systems. Moreover, when configured with the same L2 size, Arm designs occupy half the chip area of Sunny Cove (the transistor density of Intel’s 10nm process is roughly equivalent to TSMC’s 7nm process). However, Sunny Cove provides several features aimed at PCs that X1 lacks, such as multithreading, clock frequencies exceeding 5GHz, and over three times the SIMD performance. These features, along with x86 compatibility, contribute to the increased chip area.
Table 1 CPU Microarchitecture Comparison
(Cortex-X1 is a significant advancement over Cortex-A77 but still lags behind Apple’s largest CPU. * Tablet design; † Single-core. (Source: Vendor, †geekbench.com, ‡The Linley Group estimates and §die-photo analysis excluded))

Cortex-X1: The Bigger Arm

Now We All Come Together

Since Apple introduced its first “Bigger” CPU in the A9 processor (2015), it has consistently led other smartphone manufacturers in single-core performance. This performance is crucial, as most mobile applications use only one or two cores at a time, and even then, usually only for a few seconds, such as when opening a webpage. Samsung—Apple’s biggest competitor—and Qualcomm have developed their custom CPUs but have been unable to match Apple’s performance. Qualcomm subsequently shifted to Arm’s “Cortex-based technology” (BOC) program, releasing modified Cortex designs under the Kryo brand, but recently has only made nominal customizations (see MPR 12/23/19, “Snapdragon 865 Dis-Integrates”). Qualcomm branded Cortex-A77 as Kryo 585, confusing and downplaying the Cortex brand.

The Cortex-X series addresses these issues by elevating existing performance to levels far beyond mainstream Cortex-A products, supporting licensed customers to collaborate under a unified brand to challenge Apple’s dominance. This program persuaded Samsung to halt its custom CPU development and terminate its M6 design (see MPR 12/2/19, “An Epoch Conclusion”). Arm promotes Cortex-X as a “custom program,” perhaps because it replaces the semi-custom BOC work, but Cortex-X1 is a standard CPU that licensees can configure like any other Arm core. The company states that this program is exclusively for specific customers—possibly Huawei, Qualcomm, Samsung, or MediaTek—and future X series designs may offer dedicated features for customers.

While X1 shows a leap in performance, the target has still not been reached. Apple’s A13 is expected to score 20% higher than X1 in Geekbench, and by the time phones based on X1 start shipping, the next generation iPhone will introduce a more powerful microarchitecture. To close this gap, Cortex-X needs time to upgrade to more complex microarchitectures. Fortunately, Arm has already developed a series of faster cores for its Neoverse series, primarily aimed at high-performance embedded and server applications, which can be leveraged for Cortex-X (and vice versa).

The new product line grabs headlines, but for new smartphones in 2021, Cortex-A78 will be the mainstay. It will be used in some high-end products alongside X1 and will serve as the mainstream CPU for midrange and upper-midrange designs. Since X1 is optimized for peak performance, Arm focuses on maximizing performance per watt for A78. This approach extends battery life and provides better performance over extended periods. At the same time, Cortex-X1 integrates high-end mobile CPU development, helping Android phones catch up with Apple’s powerful cores.

Leave a Comment