Why Apple's M1 Chip is Superior to Intel's i9

Selected from Debugger

Author: Erik Engheim

Translated by: Machine Heart

Contributors: Xiao Zhou, Ze Nan

Besides light office work, Apple computers with Arm architecture can also play games, watch videos, and run deep learning, with decent efficiency.

Recently, many people’s M1 chip versions of Apple MacBook and Mac Mini have arrived. In many tests, we have seen exciting results: the M1 chip’s performance is comparable to high-end X86 processors, targeting CPUs like the Ryzen 4900HS and Intel Core i9, and it can even compete with Nvidia’s GPU GTX 1050Ti. Is this 5-nanometer chip really that amazing?

Since Apple launched Mac products equipped with the self-developed M1 chip, people have been curious about the M1 chip, with various reviews emerging. Recently, developer Erik Engheim wrote a long article analyzing the technical reasons behind the M1 chip’s speed and the disadvantages of chip manufacturers like Intel and AMD.

This article will discuss the following questions about Apple’s M1 chip:

The technical reasons behind the M1 chip’s fast speed.
Did Apple use any special technology to achieve this?
How difficult would it be for Intel and AMD to do the same?

Given the large number of technical terms in Apple’s official promotions, let’s start with the basics.

What is a CPU?

When we talk about Intel and AMD chips, we are usually referring to the Central Processing Unit (CPU), or microprocessor. They extract instructions from memory and execute each instruction in order.

A basic RISC CPU (not M1). Instructions move from memory to the instruction register along the blue arrow. The decoder is used to interpret the content of the instructions while connecting various parts of the CPU through the red control lines. The ALU performs addition and subtraction on the numbers in the registers.

The CPU is essentially a device that contains many memory units called registers and a computing unit called the Arithmetic Logic Unit (ALU). The ALU performs addition, subtraction, and other basic mathematical operations. However, these are only connected to the CPU registers. If you want to add two numbers together, you must retrieve both numbers from memory and place them into two registers of the CPU.

Here are some typical instruction examples executed by the RISC CPU (the type of CPU in M1):

In this case, r1 and r2 are registers. The current RISC CPU cannot perform operations on numbers that are not in the registers, such as adding numbers from two different memory locations; instead, both numbers must be placed into separate registers.

In the example above, we must first place the number at memory location 150 into register r1, and then place the number at location 200 into r2, only then can these two numbers be added according to the add r1, r2 instruction.

This old mechanical calculator has two registers: an accumulator and an input register. Modern CPUs usually have dozens of registers, and they are electronic, not mechanical.

The concept of registers has been around for a long time. For example, in the mechanical calculator shown above, registers are used to store the numbers being added.

M1 is not just a CPU

The M1 chip is not just a CPU; it integrates multiple chips into one, with the CPU being just one part.

It can be said that the M1 puts the entire computer on a single chip. The M1 includes the CPU, GPU, memory, input/output controllers, and many other components required for a complete computer, which is the concept of SoC (System on Chip) we often see in smartphones.

The M1 is a system on a chip. That is, all the components that make up a computer are placed on a single silicon chip.

Nowadays, if you buy a chip from Intel or AMD, you are actually getting a microprocessor package, whereas in the past, computer motherboards had multiple individual chips.

Example of a computer motherboard, which contains memory, CPU, graphics card, IO controller, network card, and other components.

However, we can now integrate a large number of transistors on a single silicon chip, so companies like AMD and Intel are starting to place multiple microprocessors on a single chip. We call these chips CPU cores. A core is essentially a completely independent chip that can read instructions from memory and perform calculations.

Microchips with multiple CPU cores.

For a long time, adding more general-purpose CPU cores has been the main method to improve chip performance, but one manufacturer has not done so.

Apple’s heterogeneous computing strategy is not that mysterious

In the pursuit of performance, Apple did not choose to add more general-purpose CPU cores, but instead adopted another strategy: adding more specialized chips to perform specific tasks. The benefit of this approach is that specialized chips can perform tasks using less current and faster than general-purpose CPU cores.

This is not a brand new technology. Specialized chips like Graphics Processing Units (GPUs) have existed in Nvidia and AMD graphics cards for years, performing graphics-related operations much faster than general-purpose CPUs.

Apple is just taking this direction further. In addition to general cores and memory, the M1 includes a series of specialized chips:

CPU (Central Processing Unit): The “brain” of the SoC, running most of the operating system and app code.
GPU (Graphics Processing Unit): Handles graphics-related tasks, such as visualizing app user interfaces and 2D/3D games.
IPU (Image Processing Unit): Accelerates common tasks undertaken by image processing applications.
DSP (Digital Signal Processor): Has stronger math-intensive functions than the CPU, including decompressing music files.
NPU (Neural Processing Unit): Used for high-end smartphones, accelerating machine learning tasks such as voice recognition.
Video Encoder/Decoder: Handles video file processing and format conversion in a highly efficient manner.
Secure Enclave: Responsible for encryption, authentication, and maintaining security.
Unified Memory: Allows the CPU, GPU, and other cores to quickly exchange information.

This is part of the reason why using M1 Macs for image and video editing has improved speed. Many such tasks can run directly on specialized hardware, allowing the relatively inexpensive M1 Mac Mini to easily encode large video files.

You might wonder what the difference is between unified memory and shared memory. Sharing video memory with main memory leads to low performance because the CPU and GPU must take turns accessing memory, and sharing means contention for the data bus.

The situation is different with unified memory. In unified memory, the CPU and GPU can access memory simultaneously, and they can inform each other about the locations of some memory. Previously, the CPU had to copy data from its main memory area to the area used by the GPU, but with unified memory, there is no need to copy; the GPU can use the memory by being informed of its location.

This means that the various proprietary coprocessors on the M1 can quickly exchange information using the same memory pool, significantly enhancing performance.

Why don’t Intel and AMD follow this strategy?

Other ARM chip manufacturers are also increasingly investing in specialized hardware. AMD has started to install more powerful GPUs on some chips and is gradually moving towards some form of SoC with accelerated processors (APUs that place CPU and GPU cores on the same chip).

AMD Ryzen accelerated processors combine CPU and GPU on the same chip but do not include other coprocessors, IO controllers, or unified memory.

There are important reasons why Intel and AMD do not do this. SoC is essentially an entire computer on a chip, which makes it very suitable for actual computer manufacturers like HP and Dell. Computer manufacturers can simply obtain ARM intellectual property licenses and purchase the IP of other chips to add any specialized hardware they think their SoC should have. They then hand the completed design over to semiconductor foundries like GlobalFoundries and TSMC, which is now the chip foundry for AMD and Apple.

So the question arises. The business models of Intel and AMD are based on selling general-purpose CPUs (just plugging them into large PC motherboards). Computer manufacturers only need to buy motherboards, memory, CPUs, and graphics cards from different suppliers and integrate them. But this approach is gradually fading.

In the SoC era, computer manufacturers no longer need to assemble physical components from different suppliers, but rather assemble IP (intellectual property) from different suppliers. They purchase designs for graphics cards, CPUs, modems, IO controllers, etc., from different suppliers and use them to design SoCs, then look for foundries to complete the manufacturing process.

But Intel, AMD, and Nvidia will not give their intellectual property to Dell or HP to allow them to manufacture their own SoCs.

Of course, Intel and AMD may just start selling complete SoCs, but what would be included? PC manufacturers may have different ideas about what should be included in the SoC. Conflicts may arise between Intel, AMD, Microsoft, and PC manufacturers regarding what specialized chips the SoC should contain, as these chips require software support.

For Apple, this is simple. Apple controls the entire product, such as providing machine learning developers with libraries like Core ML. Whether Core ML runs on Apple’s CPU or Neural Engine is a detail that developers do not need to worry about.

Fundamental challenges of CPU speed

Heterogeneous computing is only part of the reason. The fast general-purpose CPU core Firestorm of the M1 is indeed very fast, with a significant speed difference compared to previous ARM CPU cores. Compared to AMD and Intel cores, ARM is also very weak. In contrast, Firestorm beats most Intel cores and nearly beats the fastest AMD Ryzen cores.

Before discussing the reasons behind Firestorm’s speed, let’s first understand the core significance of fast CPU operation.

In principle, CPU acceleration tasks can be accomplished through the following two strategies:

Execute more instructions in order at a faster speed;
Execute a large number of instructions in parallel.

In the 1980s, this was easy to achieve. By simply increasing the clock frequency, instructions could be executed faster. Each clock cycle represents the time a computer takes to perform a task, but this task may be very small. An instruction consists of multiple smaller tasks, so it may require several clock cycles.

However, it is now nearly impossible to increase clock frequency, so the second strategy of “executing a large number of instructions in parallel” has become the focus of current research and development.

Multicore or out-of-order processors?

There are two solutions to this problem. One is to introduce more CPU cores. From the perspective of software developers, this is akin to adding threads, with each CPU core acting as a hardware thread. A dual-core CPU can execute two separate tasks simultaneously, that is, two threads. These tasks can be described as two separate programs stored in memory, or the same program being executed twice. Each thread needs to keep track of, for example, its current position in the sequence of program instructions. Each thread can store temporary results (which should be stored separately).

In principle, a processor can run multiple threads with only one core. In this case, the processor can only pause one thread and store the current process, then switch to another thread, and switch back later. This does not provide much performance improvement and is only used when threads are often suspended waiting for user input or data from slow networks, etc. These can be called software threads. Hardware threads mean that actual additional physical hardware (like additional cores) can be used to speed up processing.

The problem is that developers must write code to take advantage of this. Some tasks (such as server software) are easy to write. You can imagine handling each user’s connection separately. These tasks are independent of each other, so having a large number of cores is an excellent choice for servers (especially cloud-based services).

The Ampere Altra Max ARM CPU with 128 cores is designed specifically for cloud computing, where a large number of hardware threads is an advantage.

This is the reason you see the 128-core Ampere Altra Max ARM CPU. This chip is made for cloud computing, and does not require crazy single-core performance, as the cloud needs as many threads per watt to handle as many concurrent users as possible.

Apple, on the other hand, produces single-user devices, where having a large number of threads is not an advantage. Apple’s devices are mostly used for gaming, video editing, development, etc. Apple wants desktops to have beautifully responsive graphics and animations.

Desktop software typically does not need to utilize many cores. For example, computer games usually need 8 cores, in which case 128 cores would be a complete waste. Therefore, what users need are fewer but more powerful cores.

Out-of-order execution is a way to execute more instructions in parallel without executing them in multithreaded form. Developers do not need to specifically code their software to take advantage of it. From the developer’s perspective, each core runs faster.

To understand how it works, we first need to know some memory knowledge. Requesting data from a specific memory location is slow. However, compared to obtaining 128 bytes, the delay of obtaining 1 byte is not significant. Data is sent through a data bus, which you can think of as a channel or pipe between memory and different parts of the CPU. In reality, it is just some copper wires that can conduct electricity. If the data bus is wide enough, you can obtain multiple bytes simultaneously.

Thus, the CPU executes an entire block of instructions at once, but these instructions are written to be executed one after another. Modern microprocessors perform “out-of-order execution.” This means they can quickly analyze the instruction buffer to see the dependencies between instructions. Here’s an example:

01: mul r1, r2, r3    // r1 ← r2 × r302: add r4, r1, 5     // r4 ← r1 + 503: add r6, r2, 1     // r6 ← r2 + 1

The multiplication is a slow operation that requires multiple clock cycles to execute. The second instruction just has to wait because its computation depends on knowing the result placed in the r1 register first. However, the third instruction does not depend on the result of the previous instruction, so the out-of-order processor can compute this instruction in parallel.

But in reality, there are often hundreds of instructions, and the CPU can figure out all the dependencies between these instructions.

It analyzes whether the input of each instruction depends on the output of one or more other instructions by looking at the inputs of each instruction, which are the registers containing the previous computation results.

For example, in the example above, the add r4, r1, 5 instruction depends on the input from r1, which is obtained from the mul r1, r2, r3 instruction.

We can link these relationships together to form a detailed graph that the CPU can process. The nodes of the graph represent instructions, and the edges represent the registers connecting them. The CPU can analyze this type of node graph and determine which instructions can be executed in parallel and which need to wait for multiple related calculations to complete before proceeding.

Many instructions can be completed early, but their results cannot be committed yet. We cannot commit these results, or the order will be incorrect. Instructions often need to be executed in sequence. Like a stack, the CPU will pop completed instructions from the top until it hits an unfinished instruction.

Out-of-order execution functionality allows the Firestorm core on the M1 to perform significantly, making it actually more powerful than Intel or AMD products.

Why is Intel and AMD’s out-of-order execution not as good as M1?

The “Re-Order Buffer” (ROB) does not contain regular machine code instructions, which are the instructions that the CPU fetches from memory to execute. These are the instructions in the CPU instruction set architecture (ISA), which we refer to as x86, ARM, PowerPC, etc.

However, the CPU internally uses a completely different instruction set that programmers cannot see, which are micro-operations (micro-op or μop), and the ROB is filled with micro-operations.

Micro-operations are very wide (contain many bits) and can include various metadata. ARM or x86 instructions cannot add this kind of information, because it would result in:

The binary file of the program would become bloated.
Exposing details about how the CPU works, such as whether it has out-of-order execution units, register renaming, etc.
Many metadata is only meaningful in the current execution context.

You can think of it as having a public API when writing a program that needs to remain stable and available to everyone, which is the instruction sets of ARM, x86, PowerPC, MIPS, etc. Micro-operations are essentially private APIs used to implement the public API.

Typically, micro-operations are easier for CPUs to work with, as each micro-instruction can accomplish a simple, limited task. Regular ISA instructions may be more complex and lead to many things happening, which in turn translates into multiple micro-operations.

CISC CPUs typically only use micro-operations; otherwise, large, complex CISC instructions would make pipelining and out-of-order execution nearly impossible.

RISC CPUs have a choice, so smaller ARM CPUs do not use micro-operations, but this also means they cannot perform out-of-order execution and other operations.

Understanding why Intel and AMD’s out-of-order execution is not as good as M1 is crucial.

The ability to run fast depends on how quickly and how many micro-operations you can fill the ROB with. The faster you fill it, the greater this ability, and you have more opportunities to choose instructions that can be executed in parallel, further enhancing performance.

Machine code instructions are broken down into multiple micro-operations by the instruction decoder. If there are more decoders, we can parallel decode more instructions, thus filling the ROB faster.

This is where there are huge differences. The worst Intel and AMD microprocessor cores have 4 decoders, meaning they can parallel decode 4 instructions and output micro-operations.

However, Apple has 8 decoders. Not only that, the ROB is also about 2 times larger, essentially accommodating 3 times the instructions. No other mainstream chip manufacturer’s CPU has as many decoders.

Why can’t Intel and AMD add more instruction decoders?

This relates to RISC. The M1 Firestorm core uses the ARM RISC architecture.

For x86, an instruction can vary in length from 1 to 15 bytes. On RISC chips, the instruction size is fixed. If each instruction has the same length, it would be easy to split the byte stream into instructions and feed them into 8 different parallel decoders. But on x86 CPUs, the decoder does not know where the next instruction starts; it must analyze each instruction to determine its length.

Intel and AMD handle this problem in a brute-force manner; they try to decode instructions at every possible starting point. This means they have to deal with a lot of incorrect guesses and errors. This makes the decoder stage very complex and difficult to add more decoders. In contrast, Apple can easily add more decoders.

In fact, adding more would create many other problems, so AMD’s own 4 decoders are essentially its limit.

And it is this point that allows the M1 Firestorm core to process twice the number of instructions at the same clock frequency as AMD and Intel CPUs.

Some may argue that CISC instructions turn into more micro-operations, and their density is higher, so decoding an x86 instruction is similar to decoding two ARM instructions.

However, in reality, highly optimized x86 code rarely uses complex CISC instructions. In some respects, it has a RISC style.

But this does not help Intel or AMD, as even though 15-byte-long instructions are rare, decoders must be created to handle them. This leads to complexity, which prevents AMD and Intel from adding more decoders.

Isn’t AMD’s Zen3 core faster?

It is understood that the latest AMD CPU core (i.e., Zen3) is slightly faster than the Firestorm core. But this is only because the Zen3 core’s clock frequency is 5 GHz, while the Firestorm core’s clock frequency is 3.2 GHz. Despite being nearly 60% higher in clock frequency, Zen3 just barely surpasses Firestorm.

So why doesn’t Apple increase the clock frequency? Because higher clock frequencies cause the chip to heat up. This is also one of Apple’s main selling points. Unlike Intel and AMD products, their computers hardly need cooling.

Essentially, we can say that the Firestorm core does outperform the Zen3 core. The Zen3 can only maintain its lead by using more current and generating more heat. Apple chooses not to do this.

If Apple wants higher performance, they would just add more cores. This way, they can provide higher performance while reducing power consumption.

What does the future hold?

It seems that both AMD and Intel are stuck in two ways:

They do not allow their business models to easily pursue heterogeneous computing and SoC design.
The traditional x86 CISC instruction set makes it difficult for them to improve out-of-order execution performance.

But this does not mean the game is over. They can certainly increase clock frequencies, use more cooling, add more cores, enhance CPU cache, etc. But they are currently at a disadvantage. Intel’s situation is the worst, as its cores have already been beaten by Firestorm, and its GPU is weak, making it impossible to integrate into SoC solutions.

The problem with introducing more cores is that for typical desktop workloads, using too many cores leads to diminishing returns. Of course, many cores are excellent for servers. But companies like Amazon and Ampere have already used giant CPUs with 128 cores.

Fortunately, Apple does not sell its chips. Therefore, PC users can only accept the products offered by AMD and Intel. PC users may jump ship, but this is a slow process. People usually do not leave a platform they have invested heavily in immediately.

However, young professionals who have not invested much in any platform may increasingly turn to Apple in the future, thus expanding Apple’s share in the high-end market and its share of total profits in the PC market.

Reference content:

https://erik-engheim.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

This Sunday, there will be a winter carnival for developers in Beijing.

AI experts like Wang Haifeng, Zhu Jun, Li Hongyi will discuss industry, talent, and open source.
30 technical public classes full of valuable content.
Everyone will receive a gift, and there are many prizes waiting for you in the interactive exhibition area.
DJ, bands, street dance, and stand-up comedy will all be part of the AI carnival night.

On December 20, 798 Dashan awaits you.Clickto read the original textand sign up now.

Why Apple's M1 Chip is Superior to Intel's i9

For reprint, please contact this public account for authorization

Submission or seeking coverage: [email protected]

Why Apple’s M1 Chip is Superior to Intel’s i9

Leave a Comment Cancel reply