In-Depth Analysis of Loongson Domestic CPU Architecture

Editor’s Note:A few years old article that I stumbled upon today, I found the interpretation quite profound, and I am sharing it with everyone while learning.

When it comes to the domestic CPU product Loongson, the biggest controversy surrounding it is performance. Although early products of Loongson did not satisfy everyone, with the disclosure of the new architecture GS464E and the official launch of the next-generation CPU “3B2000” based on this architecture, there is growing anticipation about how its enhanced performance compares to mainstream Intel products.

In the previous data test results provided by the “Information Science Journal”, the early compiled GS464E was already close to the performance of mainstream architectures from Intel and AMD, comparable to Intel Core i3-550 and AMD FX-8320, slightly lagging behind Intel Core i5-2300, while clearly outperforming low-power architectures like Intel Atom, VIA Nano, and ARM Cortex-A57.

Recently, a deeper analysis was conducted in the June issue of “Microcomputer” magazine. So, how far has Loongson’s performance come, how does its design level compare to international competitors, and why did Loongson choose MIPS? Why is today’s Loongson not based on the currently popular ARM architecture? This article will provide professional and detailed analysis to answer these long-debated questions.

In Black and White, Who’s Right Today?

Frankly speaking, it’s no news that Loongson has been under attack in public opinion in recent years. Earlier this year, an article titled “What is the Level of Domestic Loongson?” stirred up a storm online, directly pointing out that the Loongson 3B-1500 processor, claimed to be developed for high-performance servers, is not even as good as today’s ARM Cortex-A57 mobile processor.

GS464E Instruction Fetch Unit

Without resorting to conspiracy theories, I believe that such drastic fluctuations in public opinion actually point to a fact: The external understanding of Loongson’s current situation is quite limited, and the general public lacks the ability to infer its structural design level from the complex papers published by the Loongson group, nor do they know the preferences of compilers, related software systems, and benchmark testing programs, leading to biased comparisons of performance.

For example, the Loongson 3B-1500, which has been criticized for being inferior to Cortex-A57, although it was taped out in 2012, its core design was completed around 2006, when its competitive targets were mainly Intel’s Pentium 3 and early Pentium 4 processors, naturally falling behind today’s flagship mobile CPUs.

As for the new generation Loongson microarchitecture GS464E, which is described as being able to compete with Ivy Bridge in IPC, although it has made breakthrough progress compared to the previous generation, what will it rely on to compete with Intel before achieving breakthroughs in frequency metrics?

Historical Reasons for Choosing MIPS Instruction Set?

The currently released Loongson cores are mainly divided into three series: GS1XX, GS2XXX, and GS3XX, with the GS132 series targeting ARM CortexM0 and CortexM3, GS232 and GS264 targeting ARM9, ARM11, and Cortex-A12, while GS464E, which will be introduced in this article, will target Intel Ivy Bridge.

The previously criticized Loongson 3A 1000 and Loongson 3B1500 both use the previous generation GS464 and its vector-enhanced GS464V core designs, with significant performance differences.

All of the above Loongson series products are compatible with the MIPS instruction set, and note that this compatibility does not mean that Loongson uses cores from MIPS, but just allows Loongson’s products to run instructions defined by MIPS, for example, 000000 represents the addition operation code in MIPS, and it also represents the addition operation code on Loongson processors, nothing more.

On the hardware side, from Loongson’s microarchitecture to circuit and layout design, it is all independently conducted.

Many people are also puzzled as to why Loongson did not choose the currently popular ARM instruction set, which seems to be forming a competitive stance against Intel?

In fact, the initial research period for the Loongson project was around the year 2000, when ARM was indeed included in the considerations, but in light of Loongson’s high-performance requirements, ARM’s positioning seemed untimely. At that time, the strongest core design ARM could offer was ARM11, which lacked out-of-order execution, multi-issue, and the advanced caching systems we see today.

The first ARM design supporting dual-issue, Cortex-A8, was only publicly announced in 2005, and the out-of-order executing Cortex-A9 was not introduced until around 2007. This was not entirely due to ARM’s weakness in high-performance designs but more because ARM positioned its products for embedded computing, where tight area and power constraints made many common high-performance design features difficult to implement. With technological advancements and the explosive demand for embedded computing capabilities, ARM began to work on high-performance CPUs.

GS464E Microarchitecture Framework Diagram, with the red box indicating the out-of-order execution engine

In the 1990s, major manufacturers such as MIPS and DEC Alpha successively achieved out-of-order quad-issue designs around 1995, even overshadowing Intel at that time, with MIPS’s R4000, R10000, and DEC Alpha 21164.

Among them, the 21264 remains a classic work that aspiring processor microarchitecture designers must study today, featuring deep pipelines, branch prediction, register renaming, and load-store speculation. Although MIPS and DEC Alpha gradually declined in the late 1990s, their past influence still holds. In the face of the insurmountable patent barriers of the x86 camp, the battlefield for high-performance CPUs offered very few choices for the nascent Chinese CPU.

Given the circumstances at the time, ARM was unable to penetrate markets outside of embedded systems for many years, meaning that the ARM instruction set system was virtually rootless outside of embedded systems. No one would be foolish enough to mass-produce personal computers and servers based on ARM11. Attempting to counter the Wintel alliance with a domestic project was akin to an ant trying to shake a tree.

On the other hand, MIPS and DEC Alpha, which had fought against Intel and once enjoyed the title of victor, naturally left behind a much stronger software ecosystem than ARM. The two CPU projects, Loongson and Shenwei, which have been successively launched in China, adopted the MIPS and DEC Alpha instruction sets, respectively.

Therefore, I believe that choosing MIPS/DEC Alpha was a correct decision, given the commitment to high-performance design and the desire for market support. To question this choice in light of ARM’s rise today is inevitably hindsight; no one can surpass historical constraints to predict the future a decade ahead.

Just like no one could foresee that Apple, which was teetering on the edge of survival in 2000, would rise to the top of the wave a decade later. However, history has become history; the challenges faced by today’s Loongson are well-known. Can the Loongson team achieve their original goals, and what is the new generation GS464E they have produced?

Mixed Feelings About the New Generation GS464E Architecture’s Front-End Instruction Fetch

Overall, the instruction fetch unit is one of the most significantly modified components in the new generation Loongson GS464E. In terms of overall framework, the level 1 instruction cache (the lower half of the instruction fetch unit diagram) is designed for parallel access to pursue speed, reading both instructions (IC Cache Data) and related address tags (IC Cache tag) simultaneously in the IF2 and IF3 stages. In the IF4 stage, it checks for hits and retrieves the hit instructions. Judging from the framework diagram of the instruction fetch unit, the efficiency of the front-end part is still a mixed bag.

There are three encouraging aspects: In these three areas, the structural design of Loongson GS464E has reached or even surpassed international standards.

1. First, the size of the instruction cache has reached 64KB (four-way set associative), surpassing IBM Power7’s 32KB (four-way set associative);

2. Second, the instruction fetch width has reached 8 instructions per cycle. Considering that Loongson uses MIPS32 as the base instruction set, with each instruction being 32 bits wide, the instruction fetch width of the level 1 instruction cache has reached 32 bytes per cycle, while Intel Haswell processors can only achieve a throughput of 16 bytes per cycle for the level 1 instruction cache (though the uop cache for storing decoded instructions can achieve a throughput of 32 bytes/cycle);

3. Third, GS464E has also incorporated the loop detector and loop instruction buffer that Intel began equipping from the Sandy Bridge period. This structural design allows the CPU to continuously fetch instructions while identifying which instructions constitute a loop. When a loop is detected again, it disables the instruction cache and retrieves only from the loop buffer. I believe that the design of GS464E’s loop buffer drew some inspiration from Intel’s Sandy Bridge, cleverly integrating it with the Instruction Queue responsible for decoupling the instruction fetch and decode parts into one module, supporting the storage of 56 inner loop instructions just like Sandy Bridge.

However, there are also three concerns:

1. First, when a miss occurs in the level 1 instruction cache, the missing address will be sent to the cache miss handling queue. Introducing a cache miss handling queue (commonly referred to in academia as MSHR) to handle fetching missing and pre-fetched data from lower-level storage has long been a standard configuration. However, Loongson’s cache miss handling queue is shared between the level 1 instruction cache and the level 1 data cache, and this miss handling queue only has 16 entries, meaning it can only store 16 miss requests. I anticipate that Loongson’s future designs will attempt to separate the miss handling queue or increase its capacity;

2. Second, from the framework perspective, the instruction TLB part of GS464E still lags behind international standards. Intel’s Sandy Bridge microarchitecture has already achieved a level 1 instruction TLB with 144 four-way set associative entries, while AMD’s Bulldozer has achieved a combination of 72 fully associative level 1 instruction TLB and 512 four-way set associative TLB. In contrast, Loongson only has 64 fully associative level 1 instruction TLB (the size of level 1 instruction TLB is difficult to increase), and there is no design for a level 2 instruction TLB. The weakness in the coverage of the instruction TLB may exacerbate performance losses after instruction cache misses;

3. Third, while the level 1 instruction cache of IBM Power7 is quite similar to Loongson’s, it incorporates early path selection technology, speculatively activating only the portion of the instruction cache that is about to be accessed instead of the entire cache, thereby reducing power consumption. In addition, it very aggressively splits the level 1 instruction cache into 16 banks to minimize read-write conflicts. In contrast, Loongson’s instruction cache does not mention the addition of path prediction technology and is only split into 4 banks. Considering the advantages and disadvantages, it cannot be simply concluded that Loongson’s instruction fetch efficiency can match that of international mainstream standards.

It can be seen that the Victim Cache occupies a significant amount of space in the GS464E processor architecture.

Next, let’s look at another crucial module in the front-end: the branch predictor. The branch predictor of GS464E has undergone significant modifications, indicating substantial investments and improvements in specifications. From surface parameters, it can now rival the standards of Sandy Bridge—tournament branch predictor, return address stack, indirect jump predictor, all are present.

The tournament branch predictor has three main built-in components: a local history table (Local Branch History Table) that predicts branch directions based on local history, a global history table (Global Branch History Table) that predicts branch directions based on global history, and a global selection table (GSEL) responsible for determining which of the first two has a higher accuracy. The storage space for all three has reached a size of 16K entries, speculated to be comparable to Sandy Bridge and exceeding IBM Power7.

The return address stack, responsible for predicting function call return addresses, can store 16 entries, which is on par with AMD Jaguar and IBM Power7. With the basic parameters already reaching international standards, the accuracy of branch prediction will depend on other detailed designs, such as whether the return stack supports stack recovery under mispredictions, and whether the tournament predictor incorporates additional design techniques to reduce historical table access conflicts, etc.

I cautiously believe that as long as these detailed designs do not have obvious flaws, GS464E’s branch prediction capability will be able to compete with Intel’s designs.

Still Lagging Behind: The New Generation GS464E’s Out-of-Order Execution Engine

Despite both being based on out-of-order quad-issue frameworks, the basic parameters of GS464E’s out-of-order execution engine still significantly lag behind Intel’s Sandy Bridge, as seen in Table 1.

First, the reorder buffer (Re-Order Buffer, ROB) determines the range of instructions that the out-of-order execution engine can extract for instruction-level parallelism and select independent instructions for out-of-order execution. The number of integer physical registers determines how many integer register renaming can be accommodated. Loongson still has a significant gap that needs to be bridged in these parameters.

In addition, Loongson has opted for a separate issue queue design, which is easier to increase capacity but can lead to resource allocation imbalances. AMD and MIPS have historically used this design, where all instructions allowed for out-of-order execution are stored separately by type, such as integer instructions in their own independent issue queue and floating-point instructions in another independent issue queue. If an integer-intensive program fills the integer queue, the floating-point issue queue may be completely empty.

In contrast, a centralized issue queue design is more complex and difficult to significantly increase capacity, but all instructions are stored in one place, avoiding vacancy situations. Intel has been a proponent of centralized issue queue designs for many years, with the first generation of its out-of-order multi-issue microarchitecture, P6, adopting a centralized issue queue. From Pentium 4’s Netburst, it switched to a distributed issue queue, and from the Core era, it reverted to a centralized issue queue, continuing to this day.

In the Core era, Intel’s centralized issue queue capacity was only 32 instructions, while AMD’s K8 had a total capacity of 60 instructions in its distributed issue queue, almost double. However, Intel has consistently increased the capacity of its issue queue, ultimately achieving a design of 72 entries in the centralized issue queue and 8 issue ports in Haswell. In the absence of concurrency constraints, this centralized issue queue can dispatch 8 instructions per cycle for out-of-order execution, making it a pinnacle of centralized issue queue design.

Loongson has not disclosed its dispatch width in the paper, but based on the configuration of the issue queue and execution units, I estimate it may be between 4 to 6 instructions.

Of course, in specific details, there are also commendable aspects of Loongson GS464E’s design. All frequently accessed execution units can complete operations in a single cycle and support back-to-back issuing in the case of data dependencies through aggressive data forwarding design. The memory access pipeline supports speculative issuing of memory access instructions and instruction replay (a challenging memory optimization technique that can shorten memory latency).

More commendably, the physical register file (PRF) and pointer-based issue queue processing logic have also been introduced early on, a route that Intel once abandoned but later had to adopt again to introduce AVX instruction sets. Loongson has smartly avoided the detours Intel once took.

However, these detailed improvements are insufficient to enable GS464E to challenge the Core i7 in out-of-order execution capabilities. How long it will take for Loongson to reach the design level of Haswell’s out-of-order execution engine will depend on whether Loongson’s physical and circuit design levels can support a larger issue queue, a more complex data forwarding network, and a physical register file with more concurrent read-write ports. These key structures are the focal points that underpin the design of the out-of-order execution engine.

Ample Capacity: The New Generation GS464E’s Cache System

GS464E’s level 1 data cache is also 64KB, four-way set associative, but has been changed to a serial access design, meaning that it first accesses the tag array to determine a hit before accessing the data array. The intention behind this design is to sacrifice a few cycles of memory access latency for lower memory power consumption. Given that GS464E can maintain a load-to-use latency of 4 cycles for the level 1 data cache, this cost is acceptable.

From more reliable SPEC CPU 2000 tests, the performance improvement of GS464E compared to the previous generation Loongson 3A in some sub-test items can reach up to 300% or more.

Interestingly, beneath the level 1 data cache, each GS464E core has an independent cache system, which the Loongson group refers to as Victim Cache.

Generally, a Victim Cache is a small cache attached to the level 1 cache, capable of storing very little capacity, primarily to catch data evicted from the level 1 cache and quickly return them when needed. However, Loongson’s Victim Cache has a capacity of 256KB; by terminology convention, this is no longer a Victim Cache but a proper private level 2 cache. The reason for calling it a Victim Cache should be that this cache is designed to be mutually exclusive with the level 1 cache, meaning that instructions and data present in the level 1 cache are definitely not backed up in the level 2 cache.

For reference, AMD has also used the same mutually exclusive design. In contrast, Intel and IBM adhere to an inclusive design, meaning that contents appearing in the level 1 cache are guaranteed to exist in the level 2 cache. These two design approaches primarily affect cache hit rates and cache coherence maintenance in multi-core scenarios, each with its advantages and disadvantages.

The advantage of inclusive design is that it simplifies synchronization issues in multi-core computing, as the data in the level 1 cache is guaranteed to exist in the lower levels. Therefore, when querying the data synchronization state, only the lower storage needs to be queried. However, it also has a significant drawback of wasting cache space since multiple layers of cache retain multiple copies of the same data. Conversely, mutually exclusive designs avoid space waste, but every time multi-core synchronization occurs, the entire multi-level cache system must be searched, making the coherence issues in multi-core setups more complex.

From the latest publicly available test data, under the same frequency of 1GHz, the performance of the GS464E architecture has already exceeded that of AMD FX -8320 in floating-point performance, approaching that of Core i5 2300, which uses Sandy Bridge cores.

The level 2 cache of Loongson adopts a 16-way set associative design, using the same serial access mode as the level 1 data cache. According to Loongson’s paper, this cache system employs an LRU replacement algorithm. I suspect this may be a typographical error, or there may have been a communication gap between the paper’s author and the actual cache module designer.

Because a 16-way set associative cache using an LRU replacement algorithm would require maintaining a state machine with 16! states = 20,922,789,888,000 states, which is clearly unfeasible. Historically, no cache design exceeding four-way set associativity has ever used an LRU replacement strategy; GS464E should be employing a simplified pseudo-LRU algorithm here.

It should be noted that using a pseudo-LRU algorithm is not a performance flaw; a good pseudo-LRU algorithm’s replacement accuracy is nearly indistinguishable from that of LRU. In scenarios where true LRU cannot be implemented, all cache designs exceeding four-way set associativity have adopted pseudo-LRU replacements, including Intel, AMD, and IBM.

Under this Victim Cache, there is a final level of shared on-chip level 3 cache called SCache, which is still 16-way set associative, with each SCache module being 1MB in size. The four core SCache modules combine to form 4MB. Generally, last-level cache systems are split into multiple banks and connected to a crossbar, allowing independent cores to access shared last-level cache. Loongson has directly connected the SCache to the outside of the GS464E core, possibly indicating that Loongson has adopted some design ideas from NoC (Network on Chip) in preparation for future multi-core and many-core expansions.

It is commendable that both the level 2 and level 3 caches of Loongson maintain large capacities and set associativity, but access latencies are relatively long. The access latency of the level 2 cache exceeds 20 cycles, nearly twice as slow as Intel processors’ level 2 caches, while the level 3 cache requires over 50 clock cycles, roughly on par with Intel processors.

Close Performance to Sandy Bridge: Practical Data Analysis

The practical data currently disclosed by Loongson is primarily obtained from RTL simulations and hardware acceleration simulation verification platforms, set at a frequency of 1GHz. If the actual chip can run at 1GHz and the interface timing is set correctly, there should be no significant difference in performance from actual chip operation.

As seen in Table 2, Loongson GS464E claims a memory performance improvement of 10-20 times. It is reported that the previous generation Loongson overly focused on core microarchitecture while neglecting memory controller design, even failing to properly support burst transfer modes, resulting in very low memory performance. This time, the surge in stream memory performance is attributed to fixing bugs in the memory controller, along with the addition of an aggressive multi-level prefetching mechanism. In tests involving Memcpy and Stream-Copy, Loongson’s memory controller, when operating with dual-channel DDR3-1000, shows a 20% performance gap compared to Ivy Bridge + single-channel DDR3-1333 platform in memory access with good locality.

Loongson has also disclosed test results for several small benchmarks like Whetstone, Coremark, and Dhrystone, as shown in Table 3. Generally speaking, the credibility of these small tests is lower than that of larger testing programs like Spec and PARSEC. However, these small tests can easily run on Loongson’s RTL testing platform, which provides static timing analysis results and simulates a chip through RTL code without needing tape-out, making it more convenient.

Design Layout of Loongson 3A2000/3B2000

In other program tests, the GS464E processor architecture shows over 40% performance improvement in tests with many branch instructions like Dhrystone and fewer memory operations like Coremark.

Fortunately, Loongson has also disclosed the Spec CPU 2000 test results, as shown in Table 4. Currently, the integer performance score of GS464E at 1GHz frequency is 762, reflecting a 104% increase from the previous generation, while the floating-point performance reaches 1125, with an even more astonishing increase of 278%. Its overall performance is now very close to that of the Core i5 2300, which also runs at 1GHz and uses Sandy Bridge cores.

Based on the preliminary test results from Spec CPU 2000, Loongson’s IPC appears quite optimistic, but on the other hand, Loongson cannot celebrate prematurely. According to the latest disclosed information, the Loongson processors based on GS464E architecture mainly include two versions: 3A2000, a single-socket quad-core desktop version, and 3B2000, supporting dual-socket octa-core and quad-socket hexadecacore server versions.

As this is the first version of a new architecture, the manufacturing process is still at 40nm, with a clock frequency of only around 1GHz. Considering that this is only one-third of the frequency of today’s Intel and AMD processors, the overall absolute performance of the new generation Loongson processors is estimated to be only about 20-30% of Haswell’s performance. When will they adopt a more advanced 28nm process for production, and whether they can significantly improve clock frequency on the new architecture remains a big question mark. Loongson still has a long way to go.

Conclusion: Success Cannot Be Achieved Overnight

According to information I have gathered, Loongson has now entered the military and aerospace markets, both of which place great importance on security, while performance requirements are relatively lenient. After the introduction of Loongson’s radiation-resistant version, it has also been installed on Beidou satellites. The past of China negotiating the import of radiation-resistant chips personally led by national leaders can now be declared buried in the dust of history, but it is still very challenging for Loongson to compete with Intel and AMD in the civilian market, as the absolute performance gap is too great, and it is unlikely to be overcome in the short term.

The Loongson project has been in operation for over 15 years. There has been wisdom in rejecting the use of overly long instruction word structures, but also naivety in the “one step to success” mentality; there has been arrogance in publicly claiming to defeat Intel, as well as sincerity in openly acknowledging the significant performance gap. These have all been solidified in the history of Loongson’s growth.

As time goes by, I believe that we need to set aside the past and maintain sufficient calm and rationality towards today’s progress of Loongson. As pointed out by the former director of the Institute of Computing Technology, Academician Li Guojie, in a 2004 article in the “Science and Technology Daily”:

“China’s CPU/SoC design has a long way to go. In the coming years, the performance of Loongson CPUs can only reach about half of the highest-performing CPUs abroad.” We must always remain clear-headed in recognizing that this industry has developed for over fifty years (considering the time since out-of-order execution was invented). With hundreds of thousands of top-level practitioners supporting the industry, Loongson, with its mere hundreds of staff and a fraction of the investment, achieving a performance level of a fraction of the best is already something to be proud of. As for catching up and surpassing, patience is still required.

Not long ago, during an event organized by the China Computer Federation to visit Loongson, the project leader Hu Weiwu candidly stated, “Comparing a beggar to a dragon king, the more you compare, the more you fall behind,” hoping to “emphasize overall performance, achieving overall performance surpassing in every local aspect where we are not as good as others.” Loongson has currently plotted its path towards becoming a “pillar CPU industry” by 2020-2030, which will be a protracted battle.

If successful, China’s CPU industry will gain a giant capable of self-sufficiency and competing with the US and UK. Even if it fails, the investment made by the Loongson project over the years, along with its contributions as the first domestic out-of-order multi-issue high-performance CPU pioneer, and the talent cultivated, will still be a consolation.

(Source: mydrivers)

Disclaimer:Thanks to the original author for their hard work. All articles reprinted by this account will be indicated in the text. If there are any copyright issues, please contact us for processing.

Recommended Reading

For more architecture-related technical knowledge summaries, please refer to the “Architect’s Technical Alliance Bookstore” related e-books (32 books technical materials packaged together details can be obtained through “Read the original” link).

Content is continuously updated. Order now for the “Complete Package of Architect’s Technical Store” and enjoy “free” updates of all store content in the future. During the pandemic promotion, the price is only 168 yuan (original total price 240 yuan).

Warm Reminder:

Scan theQR code to follow the public account, click onRead the original link to get the “Complete Package of Architect’s Technical Store“ e-book materials details.

In-Depth Analysis of Loongson Domestic CPU Architecture

Related posts

Leave a Comment Cancel reply