Editor’s Note: This is an old article from several years ago that I came across today and found it to be quite insightful, so I would like to share it with everyone.
When it comes to domestic CPU products like Loongson, the biggest controversy surrounding it is its performance. Although the early products from Loongson did not satisfy everyone, with the disclosure of the new architecture GS464E and the official launch of the new generation CPU “3B2000” based on this architecture, the outside world has started to eagerly anticipate the advantages and differences in performance compared to mainstream Intel products.
In the previous data testing results provided by the “Journal of Information Science”, the early compiled GS464E was already close to the performance of mainstream architectures from Intel and AMD, being roughly equivalent to Intel Core i3-550 and AMD FX-8320, slightly lagging behind Intel Core i5-2300, while significantly outperforming low-power architectures like Intel Atom, VIA Nano, and ARM Cortex-A57.
Recently, the June issue of “Microcomputer” magazine conducted a more in-depth analysis. So, how far has Loongson’s performance come, how does its design level compare with international competitors, what considerations led to the choice of MIPS architecture, and why is today’s Loongson not based on the currently popular ARM architecture? This article will provide professional and detailed analysis to answer these long-debated questions.
In Black and White, Who’s Right Today?
Frankly speaking, it is no news that Loongson has been under siege in public opinion in recent years. Earlier this year, an article titled “How Good is Domestic Loongson?” stirred up a storm online, directly pointing out that the Loongson 3B-1500 processor, claimed to be developed for high-performance servers, is not as good as the current ARM Cortex-A57 mobile processor.
GS464E Instruction Fetch Unit
Without resorting to conspiracy theories, I believe that such drastic fluctuations in public sentiment actually point to one fact: the outside world does not really understand the current situation of Loongson. The general public does not have the capability to infer its structural design level from the profound papers published by the Loongson group, nor do they know the preferences of compilers, related software systems, and benchmark programs used, so performance comparisons are often biased.
For example, the Loongson 3B-1500, criticized for being inferior to Cortex-A57, was taped out in 2012, but its core design was completed around 2006. At that time, its competitive targets were mainly Intel’s Pentium 3 and early Pentium 4 processors, so it naturally lags behind today’s flagship mobile CPUs.
As for the new generation Loongson microarchitecture GS464E, which is said to be able to compete with Ivy Bridge in terms of IPC, although it has made significant progress compared to the previous generation, what will it rely on to compete with Intel before achieving breakthroughs in frequency metrics?
Historical Reasons for Choosing MIPS Instruction Set
The currently released Loongson cores are mainly divided into three series: GS1XX, GS2XXX, and GS3XX, among which the GS132 series targets ARM CortexM0 and CortexM3, GS232 and GS264 target ARM9, ARM11, and Cortex-A12, while GS464E, which will be introduced in this article, will target Intel Ivy Bridge.
The previously criticized Loongson 3A 1000 and Loongson 3B1500 both used the previous generation GS464 and its vector-enhanced GS464V core design, showing a significant performance gap.
All the above-mentioned Loongson series products are compatible with the MIPS instruction set. Note that this compatibility does not mean that Loongson uses cores from MIPS, but merely allows Loongson products to run instructions defined by MIPS, for example, 000000 represents the addition opcode in MIPS, and it also represents the addition opcode on Loongson processors, and that’s all.
In terms of hardware, from Loongson’s microarchitecture to circuit and layout design, everything is done independently.
Many people are also puzzled as to why Loongson did not choose the currently popular ARM instruction set, which seems to be forming a competitive stance against Intel.
In fact, the research phase of the Loongson project began around 2000, and at that time, ARM was indeed considered. However, compared to the initial goal of achieving high performance set by Loongson, ARM’s positioning seemed out of place. At that time, the strongest core design ARM could offer was ARM11, which lacked out-of-order execution, multi-issue, and did not have today’s advanced cache systems.
The first ARM design supporting dual-issue, Cortex-A8, was only publicly announced in 2005, and the out-of-order executing Cortex-A9 came out around 2007. This was not entirely due to ARM’s weakness in high-performance design, but more because ARM positioned its products for embedded computing, where tight area and power budgets made many common high-performance design features difficult to implement. With technological advancements and the explosive demand for embedded computing capabilities, ARM began to focus on building high-performance CPUs.
GS464E Microarchitecture Framework Diagram, the red box section is the out-of-order execution engine
In the 1990s, several manufacturers, including MIPS and DEC Alpha, successively achieved out-of-order four-issue designs around 1995, which even overshadowed Intel at that time, with MIPS’s R4000, R10000, and DEC Alpha 21164.
Among them, the 21264 remains a classic work that aspiring processor microarchitecture designers must study today, featuring deep pipelining, branch prediction, register renaming, and load-store speculation. Although MIPS and DEC Alpha gradually declined in the late 1990s, their legacy still holds significant weight. In the face of the unbreakable patent barriers built by the x86 camp, the options available for the nascent Chinese CPU were indeed limited.
Given the situation at that time, ARM was unable to attack markets outside of embedded for many years, meaning that ARM’s instruction set system was virtually non-existent outside of embedded, and no one would be foolish enough to mass-produce personal computers and servers based on ARM11. Attempting to counter the Wintel alliance with a domestic project was akin to an ant trying to shake a tree.
In contrast, MIPS and DEC Alpha, which had fought against Intel and once enjoyed the title of victor, naturally left behind a much stronger software ecosystem than ARM. The domestic projects of Loongson and Shenwei, both supported by the government, adopted the MIPS and DEC Alpha instruction sets, respectively.
Therefore, I believe that choosing MIPS/DEC Alpha was a correct decision under the premise of insisting on high-performance design and wanting to gain market support. To question the choice made at that time based on today’s ARM rise is inevitably hindsight bias; no one can surpass historical limitations to predict the future a decade later.
Just as no one could foresee that Apple, which was precariously balancing on the tightrope in 2000, would rise to the pinnacle of the wave a decade later. However, history is history, and the challenges that Loongson faces today are well-known, so can the Loongson team achieve their original goals, and what does the new generation GS464E look like?
Mixed Blessings for the New Generation GS464E Architecture’s Front-End Fetch
Overall, the instruction fetch unit is one of the most changed parts in the new generation Loongson GS464E. In general framework, the first-level instruction cache (the lower part of the instruction fetch unit diagram) is designed for parallel access to pursue speed, reading both instructions (IC Cache Data) and related address tags (IC Cache tag) simultaneously in the IF2 and IF3 stages, and then determining if there is a hit in the IF4 stage to extract the hit instruction. Judging from the framework diagram of the fetch unit, the efficiency of the front-end part remains a mixed bag.
There are three positive aspects: in these three areas, the structure design of Loongson GS464E has reached or even surpassed international standards.
1. First, the size of the instruction cache has reached 64KB (four-way set associative), surpassing IBM Power7’s 32KB (four-way set associative);
2. Second, the instruction fetch width has reached 8 instructions per cycle. Considering that Loongson uses MIPS32 as the base instruction set, with each instruction being 32 bits wide, the fetch width of the first-level instruction cache reaches 32 bytes per cycle, while Intel’s Haswell processor’s first-level instruction cache can only achieve a throughput of 16 bytes per cycle (although the storage-decoded instructions in the uop cache can achieve a throughput of 32 bytes/cycle);
3. Third, GS464E has also incorporated loop detectors and loop instruction buffers, structures that Intel began equipping during the SandyBridge period, allowing the CPU to identify which instructions form a loop while constantly fetching instructions, and when a loop reappears, it disables the instruction cache and fetches only from the loop buffer. I believe that the design of the loop buffer in GS464E was inspired by Intel’s SandyBridge, cleverly integrating it with the instruction queue responsible for decoupling the fetch and decode parts into one module, supporting the storage of 56 inner loop instructions, just like SandyBridge.
However, there are also three concerns:
1. First, when a miss occurs in the first-level instruction cache, the missing address is sent to the cache miss queue for processing. Introducing a cache miss queue (commonly referred to as MSHR in academia) to handle fetching missing and pre-fetched data from lower memory has become standard practice, but Loongson’s cache miss queue is shared between the first-level instruction cache and the first-level data cache, and this miss queue only holds 16 entries, meaning it can only store 16 miss requests. I expect that Loongson’s subsequent designs will attempt to separate the miss queue or increase its capacity;
2. Second, from the framework perspective, the instruction TLB part of GS464E still lags behind international standards. Intel’s SandyBridge microarchitecture implemented 144 four-way set associative first-level instruction TLBs, and AMD’s Bulldozer achieved a combination of 72 fully associative first-level instruction TLBs and 512 four-way set associative TLBs, while Loongson only has 64 fully associative first-level instruction TLBs (the size of the first-level instruction TLB is difficult to enhance), and there is no second-level instruction TLB design. The weakness in the coverage of the instruction TLB may exacerbate the performance loss after instruction cache misses;
3. Third, IBM Power7’s first-level instruction cache is quite similar to Loongson’s, but it incorporates prefetching techniques, speculatively enabling only the part of the instruction cache that is about to be accessed rather than all of it, thus reducing power consumption. Additionally, it has aggressively divided the first-level instruction cache into 16 banks to minimize read-write conflicts. In contrast, Loongson’s instruction cache does not mention the inclusion of path prediction technology and is only divided into 4 banks. Overall, it cannot be simply asserted that Loongson’s instruction fetching efficiency can match international mainstream standards.
It can be seen that the Victim Cache occupies a significant amount of space in the GS464E processor architecture
Next, let’s look at another important module in the front-end—the branch predictor.
The GS464E branch predictor has undergone significant modifications, and it is evident that substantial resources have been invested to greatly enhance its specifications. From the surface parameters, it is already able to match the level of Sandy Bridge—having tournament branch predictors, return address stacks, and indirect jump predictors.
The tournament branch predictor has three main built-in components—a local history table (Local Branch History Table) that predicts branch direction based on local history, a global history table (Global Branch History Table) that predicts branch direction based on global history, and a global selection table (GSEL) responsible for determining which of the former two has a higher accuracy rate. All three have a storage capacity of 16K entries, which is speculated to be on par with Sandy Bridge and exceeds IBM Power7.
The return address stack, responsible for predicting function call return addresses, can store 16 entries, which is on par with AMD Jaguar and IBM Power7. With the basic parameters already reaching international standards, the factors that determine the accuracy of branch prediction now rest on other detailed designs, such as whether the return stack supports stack recovery on mispredictions and whether the tournament predictor has incorporated other design techniques to reduce access conflicts in the history tables.
I cautiously believe that as long as these detailed designs do not exhibit significant errors, the GS464E’s branch prediction capability can compete with Intel’s designs.
Still Lagging Behind: The Out-of-Order Execution Engine of the New Generation GS464E
Despite both being out-of-order four-issue frameworks, as seen in Table 1, the basic parameters of the GS464E’s out-of-order execution engine significantly lag behind Intel’s Sandy Bridge.
First, the Re-Order Buffer (ROB) determines the range of instructions from which the out-of-order execution engine can extract instruction-level parallelism and select unrelated instructions for out-of-order execution. The number of integer physical registers determines how many integer registers can be renamed at once; Loongson still has a considerable gap to catch up in these parameters.
Furthermore, Loongson opted for a separated issue queue design, which is easier to increase capacity but also prone to uneven resource allocation. AMD and MIPS have historically used this design, where all instructions allowed for out-of-order execution are stored separately by type; for instance, integer instructions are stored in their own independent issue queue, while floating-point instructions are stored in another separate queue. If an integer-intensive program fills the integer queue, the floating-point issue queue may be completely empty.
In contrast, a centralized issue queue design is complex and difficult to significantly increase capacity, but all instructions are stored in the same place, avoiding idle situations. Intel has been immersed in this design for many years; the first generation of Intel’s out-of-order multi-issue microarchitecture, P6, adopted a centralized issue queue, and starting from the Pentium 4’s Netburst, it switched to a distributed issue queue, only to revert to a centralized issue queue again with the Core architecture, which it has maintained ever since, making it a staunch advocate of centralized issue queue design.
In the Core era, Intel’s centralized issue queue capacity was only 32 instructions, while AMD’s K8 equipped with a distributed issue queue had a total capacity of 60 instructions, nearly double that of Intel. However, Intel has managed to increase its issue queue capacity year by year, finally achieving a design of 72 entries in the centralized issue queue and 8 issue ports in Haswell. In the absence of concurrency constraints, this centralized issue queue can dispatch 8 out-of-order executable instructions to various execution units per cycle, representing the pinnacle of centralized issue queue design.
Loongson’s paper does not disclose its dispatch width, but based on the configuration of the issue queue and execution units, I estimate it may be between 4 and 6 instructions.
Of course, in specific details, there are also commendable aspects in the design of the GS464E. All frequently accessed execution units can complete operations in a single cycle and support back-to-back issuing in the case of data dependencies through aggressive data forwarding designs. The physical register file (PRF) and pointer-based issue queue handling logic have also been introduced early on, which is a route that Intel once abandoned but later had to adopt again to introduce AVX instruction sets; Loongson has cleverly avoided the detours that Intel once took.
However, these detailed improvements are insufficient for GS464E to compete with Core i7 in terms of out-of-order execution capability. How long it will take for Loongson to reach the design level of Haswell’s out-of-order execution engine will depend on whether Loongson’s physical and circuit design standards can support larger issue queues, more complex data forwarding networks, and physical register files with more concurrent read-write ports; these key structures are crucial to supporting the design of out-of-order execution engines.
Sufficient Capacity: The Cache System of the New Generation GS464E
The first-level data cache of GS464E, like the instruction cache, is also 64KB, four-way set associative, but has been changed to a serial access design, meaning it first accesses the tag array to determine a hit, and then accesses the data array. The intention behind this design is to sacrifice a few cycles of memory access latency for lower memory power consumption, but since GS464E can maintain a load-to-use latency of 4 cycles for the first-level data cache, this cost is acceptable.
From more reliable SPEC CPU 2000 tests, the performance of the GS464E processor has increased by up to 300% in some sub-tests compared to the previous generation Loongson 3A.
Interestingly, below the first-level data cache, each GS464E core has an independent cache system, which the Loongson group refers to as Victim Cache.
Generally speaking, a Victim Cache is a small cache attached to the first-level cache, capable of storing a very limited capacity, primarily to catch data evicted from the first-level cache and quickly return them when needed. However, Loongson’s Victim Cache has 256KB, which, by definition, is no longer a Victim Cache but a proper private second-level cache. The reason for calling it a Victim Cache is likely due to the exclusionary design between this cache and the first-level cache, meaning that instructions and data appearing in the first-level cache are not backed up in the second-level cache.
As a reference, AMD has also adopted the same exclusionary design. In contrast, Intel and IBM adhere to an inclusive design, meaning that content appearing in the first-level cache must also exist in the second-level cache. These two design approaches primarily affect cache hit rates and cache coherence maintenance in multi-core scenarios, each with its pros and cons.
The advantage of an inclusive design is that it simplifies synchronization issues under multi-core computing, as data in the first-level cache is guaranteed to exist in lower levels, so querying data synchronization status only requires checking the lower storage. However, the downside is evident: it wastes cache space because multiple layers of caches hold multiple copies of the same data. The exclusionary design avoids space waste, but every time multi-core synchronization is processed, it must search the entire multi-level cache system, making the coherence issue for multi-core setups more complex.
From the latest publicly available test data, under the same frequency of 1GHz, the performance of the GS464E architecture has surpassed AMD FX-8320 in terms of floating-point performance, approaching the Core i5 2300, which uses Sandy Bridge cores.
The second-level cache of Loongson adopts a 16-way set associative design, using the same serial access mode as the first-level data cache. According to the paper from Loongson, this cache system uses an LRU replacement algorithm. I believe this might be a typographical error, or there was a communication gap between the paper’s author and the actual designer of the cache module.
Because a 16-way set associative cache would require maintaining a state machine of 16! = 20922789888000 states to implement the LRU replacement algorithm, which is clearly unfeasible. Historically, no cache design exceeding four-way set associativity has implemented an LRU replacement strategy; what GS464E is likely using is a simplified pseudo-LRU algorithm.
It should be noted that using a pseudo-LRU algorithm is not a performance defect; a good pseudo-LRU algorithm’s replacement accuracy is nearly indistinguishable from that of LRU. In cases where true LRU cannot be implemented, all cache designs exceeding four-way set associativity have adopted pseudo-LRU replacements, including Intel, AMD, and IBM.
Below this Victim Cache, there is a final cache referred to as SCache, which is an on-chip shared third-level cache. This cache is still 16-way set associative, with each SCache module being 1MB in size, and when combined, the four core SCache modules create a total of 4MB. Generally speaking, last-level cache systems are split into multiple banks and connected to a crossbar, allowing independent cores to access shared last-level caches. Loongson has directly connected the SCache to the GS464E core, which may indicate that Loongson has adopted some Network on Chip (NoC) design ideas, preparing for future expansions into multi-core and many-core setups.
It is commendable that both the second and third-level caches of Loongson maintain large capacities and degrees of associativity, but the access latencies are relatively long; the second-level cache access latency exceeds 20 cycles, nearly twice as slow as Intel’s processors, while the third-level cache requires more than 50 clock cycles, which is roughly on par with Intel processors.
Close Performance to Sandy Bridge: Empirical Data Analysis
The empirical data currently released by Loongson is mainly obtained from RTL simulations and hardware acceleration verification platforms, set at a frequency of 1GHz. If the actual chip can operate at 1GHz and the interface timing is set correctly, there will be little difference from the actual chip performance.
From Table 2, it can be seen that Loongson GS464E claims that memory performance has improved by 10-20 times. It is reported that the previous generation Loongson placed too much emphasis on core microarchitecture while neglecting memory controller design, failing even to adequately support burst transfer modes, resulting in very poor memory performance. This time, the dramatic increase in stream memory performance is also due to fixing bugs in the memory controller and adding aggressive multi-level pre-fetching mechanisms. In tests involving Memcpy and Stream-Copy, the memory controller of Loongson, when operating with dual-channel DDR3-1000, has about a 20% gap in stream access performance compared to Ivy Bridge with single-channel DDR3 1333.
Additionally, Loongson has released test results for several small benchmarks like Whetstone, Coremark, and Dhrystone, as shown in Table 3. Generally speaking, the credibility of these small tests is lower than that of larger testing programs like Spec and PARSEC. However, these small tests can easily run on Loongson’s RTL testing platform, which can provide static timing analysis results and simulate a chip through RTL code without needing tape-out, making it more convenient to use.
Design layout of Loongson 3A2000/3B2000
In other program tests, the GS464E processor architecture has seen more than a 40% performance improvement in tests like Dhrystone, which has many branch instructions, and Coremark, which involves fewer memory operations.
Fortunately, Loongson has also released the testing results from Spec CPU 2000, as shown in Table 4. Currently, the GS464E at 1GHz frequency scores 762 in integer performance, an increase of about 104% compared to the previous generation, while floating-point performance reaches 1125, with an even more astonishing increase of 278%. Its overall performance is now very close to the Core i5 2300, which also operates at 1GHz and uses Sandy Bridge cores.
If we make a rough estimate based on the preliminary test results from Spec CPU 2000, Loongson’s IPC is still relatively optimistic, but on the other hand, it cannot celebrate prematurely. According to the latest disclosed information, the Loongson processors based on the GS464E architecture mainly include two models: 3A2000, a single-socket four-core desktop version, and 3B2000, which supports dual-socket eight-core and four-socket sixteen-core server versions.
Since this is the first version of a new architecture, the manufacturing process is still 40nm, with a clock frequency of only around 1GHz. Considering that this is only one-third of the frequency of today’s Intel and AMD processors, the overall absolute performance of the new generation Loongson processors is approximately only 20%-30% of Haswell’s performance. When it will be able to adopt more advanced 28nm manufacturing processes and whether it can significantly increase operating frequencies on the new architecture remains a big question mark; Loongson still has a long way to go.
Conclusion: Success Cannot Be Achieved Overnight
According to information I have gathered, Loongson has already entered the military and aerospace markets, both of which place great emphasis on security, while performance requirements are relatively relaxed. The anti-radiation version of Loongson has also been installed on Beidou satellites. The past where Chinese leadership personally negotiated the import of anti-radiation chips can now be buried in the dust of history, but it is still very difficult for Loongson to compete against Intel and AMD in the civilian market, as the absolute performance gap is too large, and it is unlikely to change in the short term.
The Loongson project has been underway for over fifteen years, having displayed the wisdom to refuse the use of ultra-long instruction word structures, but also showing a reckless fervor of “going all in”; it has had moments of boasting about defeating Intel in the media, as well as moments of sincere acknowledgment of the significant performance gap. All of these have been solidified in the history of Loongson’s growth.
As time passes, I believe that we need to set aside the past and maintain enough calm and rationality towards Loongson’s progress today. As the former director of the Computing Institute, Academician Li Guojie, pointed out in a 2004 article in the “Science and Technology Daily”:
“China’s CPU/SoC design has a long way to go. In the coming years, the performance of Loongson CPUs can only reach about half of the performance of the best foreign CPUs,” we must always be aware that this industry has developed for more than fifty years (calculated from the time of the invention of out-of-order execution). With hundreds of thousands of top-level practitioners supporting it, Loongson’s achievement of a fraction of the performance with just a few hundred people and a fraction of the investment is already commendable; however, catching up and surpassing will require patience.
Recently, during an event organized by the China Computer Association to visit Loongson, project leader Hu Weiwu candidly stated, “Comparing a beggar with a dragon king’s treasure, the more you compare, the more you fall behind,” hoping to “focus on overall machine performance and achieve overall performance surpassing in every part where we are not as good as others.” Loongson has currently set its roadmap towards becoming a “pillar CPU industry” to 2020-2030, which will be a protracted war.
If successful, China’s CPU industry will gain a giant capable of self-sufficiency and competing with the US and UK; even in failure, the investment in the Loongson project over the years and the experience and talent cultivated as the first domestic out-of-order multi-issue high-performance CPU pioneer will still bring solace.
(Source: mydrivers)
Warm Reminder:
Please search for “AI_Architect” or “Scan” to follow the official account for technology updates, and click “Read Original” to get more technologies and exciting content.
Leave a Comment
Your email address will not be published. Required fields are marked *