Unveiling ARM’s Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis

Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis

Click the above Computer Enthusiasts to follow us

In May 2019, ARM released the Cortex-A77 CPU and Mali-G77 GPU architecture (technically speaking, IP, also known as core licensing), and the newly mass-produced Dimensity 1000+ is the first flagship 5G SoC to adopt the above IP combination.
Last night, ARM officially launched the next-generation IP, consisting of the “Three Musketeers”: Cortex-X1, Cortex-A78, and Mali-G78. Starting with the Kirin 1000 to be released this September, future 5G SoCs will benefit from them and are expected to further narrow the performance gap with Apple’s A series SoCs.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
So, what are the features of ARM’s new generation of “Three Musketeers”?
Related Reading:
Why is Snapdragon 865 the most powerful? Understanding CPU and GPU architecture!
Hardcore Science! Why is the performance of SoC dependent on architecture and process?
The biggest regret of Kirin 990! What’s so good about ARM Cortex-A77 architecture?
Self-research is the way out! See how Qualcomm and Samsung fiercely counter Apple’s A11 processor
Why can iPhones always outperform all? This is the pride of Apple!
Cortex-A78: Regular Iteration and Updates
Currently, 5G SoCs like Snapdragon 865, Dimensity 1000, and Exynos 980 all use the Cortex-A77 architecture as the “big core” in CPUs, thus obtaining powerful computing power.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
As the successor to Cortex-A77, Cortex-A78 does not have any essential changes. Cortex-A76, A77, and A78 all use the same Austin microarchitecture, and the three generations of cores have many commonalities in design.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
According to ARM, chip suppliers (such as Qualcomm, MediaTek, etc.) can easily upgrade the SoC’s IP design when building cores, without spending too much effort and cost, thus shortening the development cycle.
Therefore, do not hold too much expectation for the performance of Cortex-A78. ARM’s official data shows that compared to A77, A78’s IPC (architecture performance) has only increased by 7%, power consumption has decreased by 4%, the core size has reduced by 5%, and the area of the quad-core cluster has shrunk by 15%.
Fortunately, paired with Cortex-A78 is the latest generation of 5nm process technology, which inherently possesses a better energy efficiency ratio.
Currently, a single “big core” in the SoC consumes about 1W at full load, where the Cortex-A77 produced by 7nm process can run at 2.6GHz, while the Cortex-A78 produced by 5nm process can reach 3GHz, equivalent to a 20% performance increase at the same power consumption.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
On the other hand, at the same performance level, the 2.1GHz Cortex-A78 produced by 5nm process consumes 50% less power than the 2.3GHz Cortex-A77 produced by 7nm process, which helps improve the battery life of 5G phones.
To be honest, ARM’s way of calculating is quite confusing, unreasonable, and unfair. If Cortex-A77 were also produced using 5nm process, its performance would also see significant improvement, and power consumption would obviously decrease.
Conversely, if Cortex-A78 were produced using 7nm process, its performance and power consumption would not necessarily be better than Cortex-A77.
However, the combination of new processes and new architectures is the trend of technological development and the most economical, which is also beneficial for publicity. So let’s not be too serious about it.
Cortex-X1: The Conclusion of Self-Research
Since the iPhone 5, Apple’s A series processors have embarked on a “self-research” journey, which is also why each generation of iPhones can almost lead all processors in the Android phone circle.
所谓的“自研”,就是购买ARM最高级的指令集授权,然后根据自身需要开发兼容ARM的架构,能领先ARM公版的Cortex-A架构多少全看芯片商的技术水平。
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
Qualcomm once used self-researched Krait architecture during the Snapdragon 600/800 era, and the latest Snapdragon 820 is also self-researched Kyro. However, Qualcomm found that the energy consumption of self-researched architectures is difficult to lead the public version Cortex-A architecture too much, which is not economical. Therefore, starting from Snapdragon 835, it adopted the BoC strategy, which is commonly known as “magic modification”, based on the existing public version Cortex-A architecture for customized design.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
Huawei, starting from Kirin 980, also adopted a similar idea, with its big core also based on the Cortex-A architecture for “based” modifications, which is also a form of magic modification. It is worth noting that there are not many places where the public version Cortex-A architecture can be “magic modified”. Generally, everyone basically cuts the cache part, so whether it is Qualcomm or Kirin, the performance difference between their modified cores and the public version architecture is not large, and the key still lies in the main frequency.
Samsung joined the self-research army starting from Exynos 8890 and launched an architecture core called Mongoose. However, after four generations of independent research and development, Samsung decided to abandon the self-researched Mongoose core by the end of 2019 and disbanded the entire R&D team in Austin, Texas, fully adopting ARM’s design solutions in the future.
It can be seen that, apart from Apple, the self-research path of other chip manufacturers is fraught with difficulties and ungrateful.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
The good news is that ARM’s “Three Musketeers” includes Cortex-X1, which is actually an IP core that allows chip manufacturers to customize it highly, completely replacing the arduous path of “self-research”.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
From the architectural details released by ARM, both Cortex-X1 and Cortex-A78 are under the ARMv8.2 instruction set, and the instruction sets are compatible, but Cortex-X1 is a custom CPU core, increasing the decoding bandwidth from 4 lanes to 5 lanes, an increase of 25%. NEON floating-point performance increased from 2 channels of 128b to 4 channels of 128b, which is equivalent to a doubling of floating-point performance. In terms of cache, Cortex-X1’s L1 cache can reach 64KB, L2 cache 1MB, and L3 cache up to 8MB, which is twice that of Cortex-A78.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
Based on the above improvements, Cortex-X1 can improve single-core performance by 30% compared to the previous generation A77, and AI performance has surged by 100%.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
According to ARM’s plan, in the future, Cortex-X1 will play the role of the “super big core” in flagship 5G SoCs, while Cortex-A78 will be an ordinary “big core”, along with Cortex-A55 to form a “1+3+4” tri-cluster DynamIQ cluster, achieving a perfect balance between performance and power consumption.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
The only regret is that the Cortex-X1 core will occupy a larger package area. ARM’s data shows that when 4 Cortex-A78 cores are paired with 4MB L3 cache, their performance can be improved by 20% compared to the previous generation A77, while the core area decreases by 15%; but with 1 Cortex-X1 + 3 Cortex-A78 paired with 8MB L3 cache, although the core area will increase by 15%, the peak performance will improve by 30%.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
In other words, Cortex-X1 can bring at least an additional 10% performance improvement over Cortex-A78, which doesn’t seem like much?
Mali-G78: A Surge in Compute Units
In the Android field, ARM’s public version of the Mali series GPU has already dominated, while the former rival PowerVR has been marginalized. The arrival of the new generation Mali-G78 GPU will further consolidate ARM’s leading position in the GPU field.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
Perhaps due to the lack of significant competitive pressure, Mali-G78 continues to use the Valhall graphics architecture adopted by Mali-G77, but it has optimized the global clock domain, changing to a new two-level structure, achieving the separation of the upper shared GPU module and the actual shader core frequency, which is an asynchronous clock domain. This allows the GPU core to operate at different frequencies from other parts, thus solving the imbalance between geometric output and computation, texture, and engine, while allowing the GPU to run at different voltages, thus reducing power consumption and improving energy efficiency, which is also a common practice for desktop CPUs and GPUs.
In addition, Mali-G78 has completely rewritten the FMA (fused multiply-add) engine, including new multiplication architecture, new addition architecture, FP32/FP16 floating point, which can save 30% power consumption.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
During the Mali-G77 era, it could be equipped with a maximum of 16 compute units, namely Mali-G77 MC16, but due to cost, heat, and power consumption constraints, even the most aggressive Exynos 990 only used 11 compute units, i.e., Mali-G77 MC11, while Dimensity 1000+ was equipped with Mali-G77 MC9.
This time, Mali-G78 can arm up to 24 compute units, a 50% increase over its predecessor. However, for the reasons mentioned above, even with the latest 5nm process, the estimated maximum commercial scale will still be around 16 units, as anything more would be difficult to manage heat dissipation in smartphones.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
According to ARM’s data, thanks to improvements in architecture, process, and other aspects, the performance of Mali-G78 can increase by up to 25% compared to Mali-G77, and even under the same process conditions, it can improve by 15%, while energy efficiency improves by 10%, and machine learning performance increases by 15%.
It looks pretty good.
In addition, ARM has also launched the Mali-G68 GPU to fill the gap between the Mali-G7 series and the Mali-G5 series. From the existing data, the Mali-G68 architecture and parameters are identical to those of Mali-G78, but it can only be equipped with a maximum of 6 compute units.
Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis
In other words, Mali-G78 with 1 to 6 compute units is called Mali-G68, while those with more than 6 compute units are called Mali-G77.
The Kirin 1000 series, which will be released in September, is expected to be the first to feature the Cortex-A78 and Mali-G78 5G SoC, but whether it can use the Cortex-X1 architecture is still unknown. The Snapdragon 875, Dimensity 2000, and Exynos 1000 series to be launched next year will also use at least one member of the “Three Musketeers”. As for how much their actual performance will improve compared to existing flagship models, let us wait and see.

Unveiling ARM's Cortex-A78, X1, and Mali-G78: A Comprehensive Analysis

Click “Read Original” for more exciting content

Leave a Comment