The Genius of RISC-V Microprocessors

The Genius of RISC-V Microprocessors
Author | Erik Engheim
Translator | Dongyu
Editor | Chen Si
In the late 1990s, there was a great battle between RISC and CISC, and since then, people have said that the difference between RISC and CISC is not that important. Many people say that instruction sets are just that, and they don’t have much impact on CPUs. But that’s not the case; the instruction set determines what optimizations we can easily make for microprocessors. This article will introduce how the RISC-V processor is designed in terms of instruction set and the benefits of such design.

The Genius of RISC-V Microprocessors

In the late 1990s, there was a great battle between RISC and CISC, and since then, people have said that the difference between RISC and CISC is not that important. Many people say that instruction sets are just that, and they don’t have much impact on CPUs.

But that’s not the case; the instruction set determines what optimizations we can easily make for microprocessors.

I have recently been learning more about the RISC-V instruction set architecture (ISA), and there are a few things about RISC-V ISA that have left a very deep impression on me:

  1. It is a RISC instruction set, small in size and easy to learn. Anyone interested in learning about microprocessors should definitely choose it. Please refer to: RISC-V cheat sheet (https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf).

  2. It occupies an important position in university digital design teaching: Why do universities want to learn RISC-V (https://www.eejournal.com/article/why-universities-want-risc-v/)

  3. Designed cleverly, it allows CPU designers to use the RISC-V ISA to create high-performance microprocessors.

  4. No licensing fees, and designed to allow simple hardware implementations, theoretically, hobbyists can complete their own RISC-V CPU designs in a reasonable time.

  5. Easy to modify and use open-source design: The Berkeley Out-of-Order (BOOM) RISC-V processor (https://boom-core.org/)

For more details, please click: What is innovative about RISC-V? (https://medium.com/swlh/what-is-innovative-about-risc-v-a821036a1568)

The Revenge of RISC

As I began to understand RISC-V more deeply, I realized that RISC-V represents a fundamental shift back to what many considered an outdated era of computing. In terms of design, RISC-V is like traveling back in time to the classic RISC era of the 1980s and 1990s.

In recent years, many people have pointed out that the difference between RISC and CISC is no longer important, as RISC CPUs like ARM have added too many instructions, and many of those instructions are quite complex, making them more like hybrid RISC CPUs rather than pure RISC CPUs. The same goes for those RISC CPUs, like PowerPC.

In contrast, RISC-V is the true hardcore of RISC CPUs. In fact, if you have seen discussions about RISC-V online, you will find that some claim that RISC-V comes from some old-school RISC radicals who refuse to keep up with the times.

Former ARM engineer Erin Shepherd wrote an interesting commentary on RISC-V a few years ago:

RISC-V ISA is overly focused on minimalism. It emphasizes minimizing the number of instructions and normalizing encoding. This pursuit of minimalism leads to erroneous orthogonality (for example, reusing the same instruction for branches, calls, and returns) and a demand for redundant instructions, which affects code density in terms of instruction size and number.

Let me briefly introduce some background. Making code smaller is beneficial for performance because it allows the running code to be more easily stored in the fast CPU cache.

This article criticizes that RISC-V designers are too focused on having a small instruction set. Although this was one of the initial goals of RISC. However, the downside is that programs will actually require more instructions to complete their work, consuming more memory space. Over the years, it has been widely believed that RISC processors should add more instructions and become more like CISC. The theoretical basis is that more specialized instructions can replace the use of multiple general instructions.

Compressed Instructions and Macro Fusion

However, there are two particularly innovative features in CPU design that make the strategy of adding more complex instructions redundant:

· Compressed instructions— compress instructions in memory and decompress them in the first stage of the CPU.

· Macro fusion— fuse two or more simple instructions read by the CPU into one complex instruction.

In fact, ARM has already adopted both strategies, and x86 CPUs have adopted the latter, so this cannot be considered a new tactic for RISC-V.

However, the key is that RISC-V gains greater advantages from these strategies for two important reasons:

  1. It added compressed instructions from the very beginning. The Thumb2 compressed instruction format used on ARM must be added as a separate ISA to make the transformation, requiring an internal mode switch and a separate decoder to handle it. RISC-V compressed instructions can be added to a CPU with 400 additional logic gates (AND, OR, NOR, NAND gates).

2. RISC’s insistence on controlling the number of unique instructions pays off, creating more space to accommodate compressed instructions.

Instruction Encoding

This section needs some explanation. Instructions in the RISC architecture are typically 32 bits wide. These bits need to be used to encode different information. For example, the following instruction:

ADD x1, x4, x8    # x1 ← x4 + x8

This instruction adds the contents of registers x4 and x8 and stores the result in x1. How many bits we need to encode this depends on how many registers we have. RISC-V and ARM64 have 32 registers. We can represent the number 32 with 5 bits:

2⁵ = 32

Since we need to specify 3 different registers, we need a total of 15 bits (3×5) to encode the operands (inputs for the addition operation).

Therefore, the more things we want to support in the instruction set, the more bits will be consumed from those 32 bits. Of course, we could use 64-bit instructions, but that would consume too much memory and degrade performance.

By deliberately lowering the number of instructions, RISC-V saves more space to add bits that represent the compressed instructions we are using. If the CPU sees certain bits set to 1 in an instruction, it knows that this instruction should be interpreted as a compressed instruction.

Compressed Instructions—Two to One

This means we can pack two 16-bit wide instructions into a 32-bit word, rather than having a 32-bit word contain only one instruction. Of course, not all RISC-V instructions can be represented in 16-bit format. Therefore, a subset of 32-bit instructions is selected based on their utility and usage frequency. Uncompressed instructions can accept 3 operands (inputs), while compressed instructions can only accept 2 operands. Therefore, the compressed ADD instruction should look like this: (the # sign indicates a comment)

C.ADD x4, x8     # x4 ← x4 + x8

RISC-V assembly programs use the prefix c. to indicate that an instruction should be assembled into a compressed instruction. But in reality, you don’t need to write it. RISC-V assemblers will be able to choose whether to use compressed or uncompressed instructions at the appropriate time.

Essentially, compressed instructions reduce the number of operands. Three register operands will consume 15 bits, leaving only 1 bit to specify the operation. Therefore, by reducing the number of operands to two, we have 6 bits left to specify the opcode (the operation to be performed).

This is actually quite similar to how x86 assembly works, where there aren’t enough bits to retain 3 register operands. Instead, x86 uses some bits to allow instructions like ADD to read inputs from memory and registers.

Macro Fusion—One to Two

However, when we look at instruction compression and macro fusion together, we can see the real benefits. You see, if the CPU receives a 32-bit word containing two 16-bit compressed instructions, it can merge them into a single complex instruction.

This sounds ridiculous, are we back to square one? Are we back to a CISC-style CPU, which is exactly what we are trying to avoid?

No, because we avoid filling the ISA specification with a large number of complex instructions, like x86 and ARM strategies. Instead, we essentially express a whole set of complex instructions indirectly through combinations of various simple instructions.

Under normal circumstances, macro fusion has a problem: while two instructions can be replaced by one, they still consume double the memory space. But with instruction compression, we do not consume more space. We achieve the best of both worlds.

Let’s take a look at an example from Erin Shepherd. In her criticism of the RISC-V ISA, she presented a simple C function. To explain it more clearly, I have rewritten it as follows:

int get_index(int *array, int i) { 
     return array[i];
 }

Compiled on x86 as:

mov eax, [rdi+rsi*4]
ret

When you call a function in a programming language, parameters are typically passed to the function in registers according to established conventions, which will depend on the instruction set you are using. On x86, the first parameter goes in register rdi, and the second in rsi. By convention, the return value must go in register eax.

The first instruction multiplies the content of rsi by 4. This contains the variable i. Why multiply? Because the array is made up of integer elements, so the spacing between them is 4 bytes. Therefore, the byte offset of the third element in the array is actually 3×4 = 12.

Then we add it to rdi, which contains the base address of the array. Thus, we get the final address of the ith element in the array. We read the content of the storage unit at that address and store it in eax: mission accomplished.

On ARM, it is quite similar:

LDR r0, [r0, r1, lsl #2]
BX  lr ; return

Here we are not multiplying by 4, but shifting register r1 left by 2 bits, which is equivalent to multiplying by 4. This may also more authentically represent what happens in the x86 code. On x86, you can only multiply by 2, 4, or 8, and all of these can be achieved by shifting left by 1, 2, or 3 bits.

I think you can guess the rest of the x86 description. Now let’s get into RISC-V, where the really interesting content begins! (the # sign indicates a comment)

SLLI a1, a1, 2     # a1 ← a1 << 2
ADD  a0, a0, a1    # a0 ← a0 + a1
LW   a0, a0, 0     # a0 ← [a0 + 0]
RET

Registers a0 and a1 on RISC-V are just aliases for x10 and x11. They are where the first and second parameters for function calls are placed. RET is a pseudo-instruction (abbreviation):

JALR x0, 0(ra)     # sp ← 0 + ra
                   # x0 ← sp + 4  ignoring result

JALR jumps to the address referenced by ra, which is an alias for x1.

Regardless of how you look at it, this seems pretty terrible, right? For such a simple and common operation like performing an indexed lookup in a table and returning the result, it requires twice the instructions.

It does indeed look terrible. This is why Erin Shepherd strongly criticized RISC-V’s design choices. She wrote:

The simplification of RISC-V makes the decoder (i.e., the CPU front end) simpler, but at the cost of executing more instructions. However, the real tricky problem is expanding the width of the pipeline, and slightly irregular or very irregular instructions have no major issues with decoding, the main difficulty is determining the length of the instructions, especially x86, because it has many prefixes.

However, thanks to instruction compression and macro fusion, we can regain some ground.

C.SLLI a1, 2      # a1 ← a1 << 2
C.ADD  a0, a1     # a0 ← a0 + a1
C.LW   a0, a0, 0  # a0 ← [a0 + 0]
C.JR   ra

Now, this occupies the same memory space as the example on ARM.

Okay, next let’s do some macro fusion!

In RISC-V, one of the rules that allow multiple operations to be fused into one is that the target register must be the same. The ADD and LW (load word) instructions fall into this category. Therefore, the CPU will convert them into a single instruction.

If SLLI were also the same, we could merge these three instructions into one. Thus, the CPU would see something similar to more complex ARM instructions:

LDR r0, [r0, r1, lsl #2]
Why can’t we write this macro operation directly in code?

Because our ISA does not support it! Remember, the available bits are limited. Why not write longer instructions? No, that would consume too much memory and quickly fill up precious CPU caches.

However, if we create these semi-complex long instructions inside the CPU, there is nothing to worry about. The CPU faces at most a few hundred instructions at any given time. So wasting 128 bits on each instruction is not a big deal. Everyone has enough silicon.

Therefore, when the decoder receives a normal instruction, it typically converts it into one or more “micro” operations. These “micro” operations are the actual instructions that the CPU processes. They can be very “broad” and contain a lot of additional useful information. Calling them “micro” seems ironic because they are actually quite “broad.” However, in fact, “micro” refers to the limited number of tasks they perform.

Complexity of Instructions

Macro fusion makes a slight change to the work of the decoder: instead of converting one instruction into multiple micro-operations, it converts multiple operations into one micro-operation.

Therefore, what happens in modern CPUs looks quite strange:

  1. First, it merges two instructions into one through compression.

  2. Then it splits it into two parts through decompression.

  3. By macro fusion, it merges them into one operation.

Other instructions may end up being split into multiple micro-operations instead of being fused together. Why do some fuse while others split? Is this confusion systematic?

The key is that the complexity of the micro-operation must be appropriate:

  • It cannot be too complex, otherwise it cannot be completed within the fixed number of clock cycles allocated for each instruction.

  • It cannot be too simple, because that purely wastes CPU resources. Executing two micro-operations takes twice as long as executing one micro-operation.

All of this began with CISC processors. Intel began breaking down complex CISC instructions into micro-operations so that they could adapt to pipelines more easily like RISC instructions. However, in later designs, they realized that many CISC instructions were so simple that they could easily be fused into a moderately complex instruction. The fewer instructions you execute, the faster you naturally complete them.

Benefits

Well, the above explains a lot of details, and you may find it difficult to grasp the main points at once. Why do compression and fusion? It sounds like a lot of extra work to do.

First, instruction compression is completely different from zip compression. The term “compression” is actually a bit of a misnomer because decompressing a compressed instruction immediately is very simple. Doing this does not waste time. Remember, this is quite simple for RISC-V. It only requires 400 logic gates to accomplish decompression.

Macro fusion is similar. While it seems complex, these methods have already been applied in modern microprocessors. Therefore, the complexity of this has already been paid for.

However, unlike ARM, MIPS, and x86 designers, RISC-V designers understood instruction compression and macro fusion when they began designing the ISA. Or more accurately, when their initial ISA was designed, those competitors did not know this. When designing the 64-bit versions of x86 and ARM instruction sets, they may have considered this. So why didn’t they do it? We can only speculate. Perhaps these companies did not like to deviate too much from earlier versions when creating new ISAs. Typically, it focuses more on eliminating obvious errors from the past rather than overturning previous theoretical foundations.

Through various tests on the first minimal instruction set (https://arxiv.org/pdf/1607.02318.pdf), the designers of RISC-V made two important discoveries:

  1. Typically, RISC-V programs occupy close to or less memory space than any other CPU architecture, including x86, which is recognized as the most space-efficient due to its CISC ISA.

  2. It requires fewer micro-operations to execute.

Essentially, because they considered fusion when designing the basic instruction set, they were able to fuse enough instructions so that for any given program, the number of micro-operations the CPU must execute is fewer than that of competitors.

This has led the RISC-V team to double down on macro fusion, making it a core strategy of RISC-V. You can see a lot of descriptions in the RISC-V manual about what operations can be fused. You will see which instructions have been revised to facilitate the fusion of those common patterns.

Keeping the ISA small means students find it easier to learn. In other words, for a student learning CPU architecture, actually building a CPU that runs RISC-V instructions will be easier.

RISC-V has a small core instruction set that everyone must implement. All other instructions exist as extensions. Compressed instructions are just one optional extension. Therefore, if it is a simple design, it can be omitted.

Macro fusion is just an optimization. It does not change the overall behavior, so it does not need to be implemented in specific RISC-V processors.

In contrast, for ARM and x86, much of the complexity is not optional. The entire instruction set and all complex instructions must be implemented, even if you just want to create the smallest and simplest CPU core.

RISC-V Design Strategy

RISC-V leverages our understanding of modern CPUs today and uses this knowledge to guide their choices in design. For example, we know:

  • Today, CPU cores perform branch prediction in advance. Their accuracy exceeds 90%.

  • CPU cores are superscalar architectures, meaning they execute multiple instructions in parallel.

  • They achieve superscalar architecture through out-of-order execution.

  • They are pipelined.

This means that there is no longer a need for things like conditional execution, which ARM supports. Supporting it on ARM would consume some bytes in the instruction format. RISC-V can save these bits.

The original purpose of conditional execution was to avoid branches because they are detrimental to the pipeline. To run quickly, CPUs typically prefetch the next instruction so that when the previous instruction completes its first phase, it can quickly select the next instruction.

However, with conditional transfers, when you start filling the pipeline, you do not know where the next instruction will be. However, superscalar CPUs can simply execute two branches in parallel.

This is also the reason why RISC-V does not have status registers. Because this would create dependencies between instructions. The more independent each instruction is, the easier it is to run in parallel with another instruction.

The RISC-V strategy is essentially how we make the ISA as simple as possible, making the minimal implementation of the RISC-V CPU as simple as possible without making design decisions that affect CPU performance.

For more content, please read: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

What Does the Industry Say?

Well, this may sound good in theory, but does it really hold true in the real world? What do tech companies think about it? Do they believe that RISC-V ISA offers tangible benefits compared to commercial ISAs (like ARM)?

The Electronic Engineering Journal is a great publication, and many people enjoy reading about this topic on it. The publication interviewed Dave Ditzel, a senior expert in microprocessors. His company evaluated different options for building specialized hardware to accelerate machine learning. Jim Tull wrote (https://www.eejournal.com/article/another-risc-v-religious-conversion/):

RISC-V was not even on the procurement shortlist, but as Esperanto engineers researched it more and more, they gradually realized it was not just a toy or a teaching tool. “We think RISC-V (compared to Arm or MIPS or SPARC) might lose 30% to 40% of compilation efficiency because it is too simple,” Ditzel said. “But our compiler experts benchmarked it, and unbelievably it was only about 1%.”

Esperanto Technologies is still a small company. What about large companies like NVIDIA, which have a lot of experienced chip designers and resources? NVIDIA uses a general-purpose processor called “Falcon” on their boards. When evaluating alternatives, RISC-V ranked high (https://riscv.org/wp-content/uploads/2017/05/Tue1345pm-NVIDIA-Sijstermans.pdf).

Original English text:

https://erik-engheim.medium.com/the-genius-of-risc-v-microprocessors-b19d735abaa6

Translator’s Profile: Dongyu, a small tech enthusiast engaged in R&D process and quality improvement, focusing on programming, software engineering, agile, DevOps, cloud computing, etc., very willing to share fresh IT news and in-depth technical articles translated from abroad, has translated and published “In-depth Agile Testing” and “Continuous Delivery in Practice”.

The Genius of RISC-V Microprocessors

Are you also

Leave a Comment