Click below【Learn Embedded Together to follow, learn together, and grow together
Learning embedded systems and understanding the overall process of CPU operation is very necessary:
CPU
The Central Processing Unit (CPU) is one of the main components of a computer, primarily responsible for interpreting computer instructions and processing data calculations within computer software.
The introduction mentions that the CPU’s function is instruction processing and data calculation, so the CPU must have both a calculation part and an instruction analysis part.
ALU Logic Unit

The logic unit: The biggest function of the CPU is logic operations. In the CPU, the logic unit is responsible for all logic operations. The logic unit, also known as the ALU, consists of several parts. For example, A + B = C.
-
• Input Data; at least two numbers are needed for an operation, so the input side of the ALU consists of two operands (like A and B in the above diagram).
-
• Instruction (OP): The operation instruction, having the operands A and B, determines what operation to perform on these numbers — addition, subtraction, multiplication, division, or a combination? The operation instruction decides what logical operation A and B undergo.
-
• Output Result: The output of the operation result.
-
• Flag: What is the function of the flag? If A + B is to be computed and the CPU’s operation bit is 16 bits, meaning the maximum calculated value for two numbers is 15, if A = 10 and B = 6, A + B exceeds the ALU’s calculation range. How does the ALU handle the exceeded range? The ALU generates an Overflow flag through the flag bit and stores the exceeded part in the register group as input for the next calculation. Of course, the ALU can also compare the sizes of two operands. For example, if A = 10, B = 6, and OP is the operation instruction to compare A and B, OP uses subtraction. The result of A – B will generate a flag of 0/1, allowing one to determine the sizes of A and B by checking the computed flag.

The flag bits generated by ALU operations are usually placed in a dedicated register called the “Program Status Register” (PSR). Each time an instruction is executed, the corresponding status bits are updated. The status bits affected by each instruction are different, and the chip manual needs to be consulted.
What status does the status register need to store? Although various chips have their own types of flag bits, every chip must have the following flag bits.
-
• Z: (Zero) Zero flag: A – B = 0, if the calculation result is 0, a 0 flag will be generated.
-
• N: (Negative) Negative flag: The calculation result is negative.
-
• O: (Overflow) Overflow flag: As above, A + B = 16 > 15 exceeds the calculation range, generating an overflow flag.
-
• C: (Carry) Carry flag.
Data Storage
We know that the ALU operation requires two operands as input, so where do these numbers come from?
The sources of ALU operands are many; generally, there are the following sources.
Internal Registers

-
• Used for temporarily saving/retrieving operands. For example, the input data of the ALU can come from the CPU’s internal general register group, and the results of the ALU operation can also be stored in registers (note that the storage in the CPU’s internal registers is only temporary).
-
• Any CPU contains several general/special registers.
-
• The data and width of registers are important indicators of CPU performance.
Comparing CPU internal registers to scratch paper during an exam is very appropriate. The intermediate results of each calculation can be written on the scratch paper (the storage of ALU output results), and the intermediate results can also provide input for the next calculation (to provide input operands for the ALU).
When you calculate the final result, you will write the result on the exam paper. The data on the scratch paper is discarded when you submit the exam (temporary storage). Each person’s scratch paper is unique (special registers), but everyone writes intermediate calculation results on their scratch paper (general registers).
However, some people are special and prefer to bring their own scratch paper, which is larger than yours (the width of the register) and has two sheets (the number of registers). Therefore, they can store more intermediate calculation data. But, exam results do not necessarily exceed yours.
The initial calculation data on the scratch paper comes from the exam paper (the initial operands), and the data in the register group is temporarily stored. The calculated data from the scratch paper must be written on the exam paper (the final result), where the exam paper represents external storage.
External Storage
What content is stored in external storage? As mentioned, the CPU’s internal registers temporarily store data (like scratch paper during an exam), while the final submission is the exam paper (referred to as external storage for the overall content of the storage, the beginning of the article has already detailed this, here I will collectively refer to RAM and ROM as external storage).
If we compare the entire CPU operation process to a math exam, the function of the ALU within the CPU is to help us calculate the result of the operation. Where do the data A and B, which are the sources of the operation, come from?
That is from the math exam paper. The questions on the exam paper require us to calculate the result of A + B, which is what the exam paper asks us to do (the requirement here is the instruction that the CPU needs to execute). After you have done some calculations on the scratch paper, the result is written on the exam paper (this process is simply referred to as “data saved to the exam paper”). The result of the calculation is saved to the exam paper, similarly, the result of the ALU operation is saved to external storage. Through the above process, we conclude that external storage contains:
-
• The instructions executed by the CPU.
-
• The data saved during CPU operation.
Instructions Executed by the CPU
Data Executed by the CPU
We previously compared the external storage of data to the storage of the ALU operation results (writing the results from the scratch paper to the exam paper), indicating that the storage has the final data required by the user. However, for the CPU’s external storage, in addition to storing the final data results, it can also store temporary data.
Both the internal registers of the CPU and the external storage can temporarily store data. So, is there a conflict between the two?
The temporary storage times for both are different. Since the internal registers are close to the CPU, retrieving data from internal registers is the fastest. Data that needs to be frequently accessed is best stored in the internal general registers.
However, data that is not frequently accessed and is called after a few seconds should be stored in external storage. Of course, this is not absolute; users can remove compilation optimization and store frequently accessed data in external storage. However, this approach will obviously reduce data processing speed.
Fetching Instructions
How does the CPU fetch instructions from external storage?
As mentioned earlier, the teacher puts the good exam paper into a sealed bag. However, unfortunately, the next day the teacher finds that one exam paper is missing, and angrily decides to create a new exam paper on the spot. He writes a question on the blackboard, and the students answer it. At this time, the teacher’s mind holds the content of the exam paper, and the students need to calculate the question posed by the teacher. The way students obtain the exam paper questions is through the blackboard.
Similarly, the CPU needs to fetch instructions; doesn’t it also need a blackboard to store these instructions? In the CPU, this blackboard is the Program Counter (PC). Unlike the blackboard mentioned above, the PC register does not store instructions (questions) but rather the addresses of the instructions.
Program Counter

The Program Counter (PC) is a special register within the CPU that always points to the address of the next instruction. As shown in the diagram above, external storage consists of individual cells, each with an address number. Each cell stores instructions that the CPU needs to execute, and the PC always points to the address of the next instruction. If the current instruction is NOP, then the address pointed to by the PC instruction is 0001.
Why does the PC register always point to the next instruction instead of the current instruction?

Storage
We know that the instructions and data of the CPU come from external storage, and the CPU’s storage can be classified as follows.

ROM (Read-Only Memory)
Can only read and cannot write information. Once information is written, it becomes fixed, and even when the power is cut off, the information will not be lost, hence it is also called fixed storage. In microcontrollers, the first instruction executed by the CPU upon power-up and reset is fetched from ROM. During the power-up process, what content does the CPU read from ROM?
-
• Allocating address space for global variables: Copying initial values from ROM to RAM. If no initial value is assigned, the initial value corresponding to this global variable’s address is either 0 or uncertain.
-
• Setting the length and address of the stack segment: In microcontroller programs developed in C language, the stack segment length is generally not set, but this does not mean it is unnecessary. The stack segment is primarily used to “save the scene” and “restore the scene” during interrupt handling, and its importance is self-evident.
-
• Allocating starting addresses for data segment, constant segment, and code segment. The addresses of the code segment and constant segment can be ignored; they are fixed in ROM, and no matter how they are arranged, they will not affect the program. However, the address of the data segment must be concerned. The data in the data segment needs to be copied from ROM to RAM, and in RAM, there are both the data segment and the stack segment, as well as the general working register group. Typically, the addresses of the working register group are fixed, which requires that when absolutely addressing the data segment, the data segment must not cover the addresses of all working register groups.
Since ROM is a read-only memory, why does it say that data in the code can be written in?
Firstly, it is true that ROM is read-only storage. This is only the case for early microcontrollers. In the early days, data and programs were solidified in the chip at the factory and could not be changed. Forcibly changing them would directly ruin the chip.
Since ROM is fixed at the factory and cannot be written, how is programming data read and written in daily life?
Due to the unprogrammable nature of early ROMs, which was criticized by many, PROM emerged with the development of technology.
PROM (Programmable Read-Only Memory)

PROM is a programmable ROM that is blank at the factory, allowing users to write information as needed, but once written, the information cannot be changed.
PROM is programmed by melting the fuses inside using a voltage of 20V, achieving a 0/1 write. Thus, PROM is a one-time programming process, as the fuses cannot be restored once melted. If new data needs to be written, a new PROM must be used. The most representative example of PROM is the Nintendo game cartridge.
PROM programming requires high working voltage and can only be programmed once, which is not suitable for long-term development.
EPROM (Erasable Programmable Read-Only Memory)
EPROM is a programmable and erasable memory.
The working principle of EPROM is to achieve programming under high voltage through electronic transitions, and then use ultraviolet light to irradiate the chip to cause electronic jumps to erase.
This method of high-voltage programming and ultraviolet erasing can be reused. However, it is too expensive and cannot be widely used.
EEPROM (Electrically Erasable Programmable Read-Only Memory)

EEPROM is electrically erasable and programmable, allowing erasure and programming at conventional voltage levels. It can be programmed and erased under 3.3V-5V, making it suitable for multiple programming and erasing.
Flash Memory

Flash memory allows block erasure and writing, with advantages of light weight, low energy consumption, and small size. However, it requires finding and then writing, and the number of block erasures is limited, with read-write interference.
Common examples include USB and SD cards. In the block diagram, if a user downloads the binary file of the program code to an external USB, the CPU can achieve startup through the USB by controlling the USB control line in the startup device controller.
This means that when the CPU is powered on, the first instruction is started from the file in the USB. Of course, during the writing process, it is necessary to ensure that the data inside the USB is clean and only contains the program that the CPU needs.
During the process of reinstalling the PC, the first step is to install the WIN system on a clean USB, and write the operating system data to the hard drive. After the computer is powered on, it initializes the operating system data by running the hard drive. The USB and hard drive in this process both belong to flash memory.
Returning to the initial question, how can ROM not store data, yet programming data in daily life be read and written? For example, how does STM32 store data? ROM is the program storage, and data will not be lost after power-off, but the data will not change during program operation. Early microcontroller ROMs have been replaced by flash memory due to the hassle of erasing and modifying, high costs, or low-cost OTP types that cannot modify data.
Because flash memory is easy to erase and write, some modern microcontrollers support online programming, allowing the modification of FLASH content through specific program execution methods, thus achieving online modification of program storage. This does not conflict with the previous statement that the content of program storage cannot be changed during operation, as it only works in read-only mode.
RAM (Random Access Memory)
Random Access Memory is the internal storage that exchanges data directly with the CPU. It can be read and written at any time and is usually used as temporary data storage for operating systems or other running programs. RAM can be further divided into Static Random Access Memory (SRAM) and Dynamic Random Access Memory (DRAM). SRAM has the advantage of fast access but is more expensive to produce, typically used as cache. DRAM, due to its lower unit capacity price, is widely used as the main memory of systems.
Characteristics of DRAM:
-
• Random Access: The so-called “random access” means that the time required to read or write a message in memory does not depend on the location of that message. In contrast, serial access memory includes sequential access memory (like tape) and direct access memory (like disk).
-
• Volatile: When the power is turned off, RAM cannot retain data. If data needs to be preserved, it must be written to a long-term storage device (like a hard disk). The biggest difference between RAM and ROM is that the data stored in RAM will automatically disappear after power-off, whereas ROM will not.
-
• High Access Speed: Modern random access memory is the fastest among all access devices in terms of writing and reading speed, and its access latency is negligible compared to other storage devices involving mechanical operations (like hard disks and optical drives). However, it is still slower than SRAM used as CPU cache.
-
• Needs Refreshing: Modern random access memory relies on capacitor storage. When a capacitor is charged, it represents 1 (binary); when uncharged, it represents 0. Due to leakage, if not specially handled, the charge will gradually dissipate over time, causing data errors. Refreshing means recharging the capacitor to compensate for the lost charge. DRAM reading inherently has a refreshing effect, but general timing refreshing does not require a complete read; it only needs to select a row of the chip, and the entire column of data can be refreshed. Meanwhile, all related memory chips can simultaneously select the same row, so sequentially refreshing all columns within a period completes the refreshing of all memory. The need for refreshing explains the volatility of random access memory.
-
• Sensitive to Static Electricity: Like other fine integrated circuits, random access memory is very sensitive to environmental static electricity. Static can interfere with the charge of capacitors within the memory, leading to data loss or even damaging the circuit. Therefore, one should touch a metal ground before touching random access memory.
SRAM
Static Random Access Memory (SRAM) is a type of random access memory. The term “static” means that as long as power is maintained, the data stored inside can remain constant. In contrast, Dynamic Random Access Memory (DRAM) requires periodic updates to maintain its stored data. However, when power is lost, the data stored in SRAM will still disappear (it is called volatile memory), which differs from ROM or flash memory that can retain data after power-off. SRAM is more expensive than DRAM but faster and very low power (especially in idle state). Therefore, SRAM is preferred for high bandwidth requirements or low power consumption, or both.

As mentioned above, when the CPU powers on, the first instruction is obtained from ROM. The process is shown in the red line in the diagram above. The disk contains the initialization of the operating system (if it’s a bare machine, it includes the interrupt vector table and some values to initialize). Then, after the CPU receives the instructions, it writes this initialization data into RAM (as shown by the yellow line in the diagram), and then reads data from RAM (as shown by the green line in the diagram) to execute the instruction. The process is illustrated in the diagram.
From the diagram, we can see an issue: why does the CPU write data from ROM into RAM and then read it into the CPU? Isn’t it supposed to be faster to read data directly from ROM?

For example: if you have a 1TB video in your mechanical hard drive that you want to share with your friend, there are two options: First, you take your computer and drive for an hour to your friend’s house, copy the video to his computer, taking a total of two hours for your friend to watch it. Second, you send your friend a link to download the video from Baidu Netdisk, which may take him 11 days to download. This is indeed slower than going to his house, but he can watch the already cached parts of the video while downloading.
The time taken to use data directly from ROM to CPU is indeed longer than ROM to CPU to RAM to CPU. However, the speed of retrieving data from RAM is much faster than from ROM. Since using data directly from ROM consumes a lot of time, is there a way to write the data directly from ROM into RAM?

Modern computers avoid the time consumption of reading from ROM to CPU by using DMA technology, as shown in the red line of the diagram. Additionally, a cache can be added between ROM and CPU to speed up the CPU’s reading of program data. This cache is also known as “solid-state drive”.
In the above process, we clarified the general process of CPU reading data. The moment the CPU is powered on, ROM holds our program data and operating system. After the CPU reads data from ROM, it writes the data into RAM, then reads it from RAM. If the user presses the save button, the CPU reads the data from RAM and writes it to ROM (where the ROM is the flash disk, which is readable and writable). With this foundation, we can delve into understanding the three major processes of the CPU and the role of the PC (Program Counter).
In discussing storage, the concept of the stack is often mentioned. So what is the stack? What is its purpose?
Stack
The stack (also known as a stack) is a data structure. A stack is a data structure where data items are arranged in order and can only be inserted and deleted at one end (called the top of the stack). In microcontroller applications, the stack is a special storage area that temporarily holds data and addresses, usually used to protect breakpoints and scenes. The size is generally defined by the programmer. When programmers refer to putting data “in the stack,” they usually mean the “stack area” (since that is where the data is stored), but simply saying “stack” is too monotonous.
In simple terms, the purpose of the stack is to better allocate storage (RAM and ROM). Since each data stored in the storage has an address corresponding to it, there is a special type of data in the C language called pointers, which generally point to the address where the data is stored.
Think of storage as a series of small black rooms containing data, each with a lock (address). A pointer is like a key, allowing the user to unlock the data in the small black room (the pointer points to the address, enabling access to the data inside). The storage capacity is vast (the number of small black rooms is numerous), so it is essential to manage storage allocation effectively, and stack management is undoubtedly the best method.
From the above, we can see that the PC pointer register loads instructions one by one from the instruction register. However, in C language programs, there is not only one instruction running. Sometimes, it may jump from one function to another. How should the address of the original function be stored after the jump?
Method 1: Use a register inside the CPU.
A register can be added to the CPU specifically for storing the address of the function before the jump (this register is temporarily called the “address register”). After the sub-function finishes running, the PC pointer register can point to the address register to complete the function jump. While this method can solve function calls, if the sub-function contains nested sub-functions, using N address registers for N sub-functions is clearly insufficient. So what should be done for functions with many nested sub-functions?
Method 2: Use a stack to store addresses during function nesting.
To better understand this process, some basic knowledge of stacks is required.

As shown in the diagram, since storage is a contiguous address storage space, the stack is also a segment of contiguous storage space. Additionally, the stack stores a large amount of data that the CPU needs to use. When the CPU needs data, it must retrieve it from the stack, and the results of CPU operations must also be stored in the stack.
This process of “fetching” and “storing” inevitably changes the amount of data. Therefore, it is necessary to establish a mechanism to determine how much data remains in the stack; this is the function of the stack pointer register (SP). Its function is illustrated in the diagram below:

We know that the stack is a segment of storage space for storing data. In the diagram above, the stack has a high address at the bottom and a low address at the top. The stack is like a storage jar that is closed on one side and open on the other. The SP is a pointer register that points to the current address of the stack. In the diagram, when the data 1111111 has not been stored in the stack, the stack is empty, so the current SP points to the bottom.
When the CPU writes data into the stack, it writes the data 1111111, and SP decreases (leaving a space for 11111). The SP points to the address of 1111111, and this process is called “pushing to the stack.” Similarly, when 2222222 is pushed to the stack, the SP pointer points to the address of 222222. When 111111 is popped, the SP address is incremented (reclaiming the data space), and the data 2222222 is popped, allowing the CPU to retrieve the data and overwrite the internal register storing A. This achieves the value assignment of B to A.
When the value of A is to be popped, the PC continues to point to the next instruction address, and the SP increments to pop the value of A, allowing the CPU to retrieve the data and overwrite the internal register storing B. This achieves the value assignment of A to B.
Through the above steps, the values of A and B are exchanged. We can observe that the value of C does not appear in this process, as C is merely defined but has no actual size, so it is optimized by the compiler.
In the example above, we can see that both instructions and programs exist in the stack, so how is the instruction data managed? Generally, there are two methods for managing programs and data: the von Neumann architecture and the Harvard architecture.
Von Neumann Architecture and Harvard Architecture
Von Neumann Architecture
Early computers were composed of various gate circuits, assembled into a fixed circuit board to execute a specific program. Once a modification to the program function was needed, the circuit board had to be reassembled, making early computer programs hardware-based!
In early computers, programs and data were two entirely different concepts; data was stored in memory, while the program was part of the controller.
This resulted in extremely low computing efficiency. Later, von Neumann proposed encoding programs and storing both data and encoded programs in memory, allowing the computer to call the stored program to process the data.
This means that no matter what program it is, it will ultimately be stored in the form of data in memory. To execute the corresponding program, it only needs to sequentially retrieve instructions from memory, analyze, and execute them. The essence of von Neumann architecture lies in this: it reduces hardware connections, leading to a separation of hardware and software, meaning hardware design and program design can be executed separately.
The core ideas of von Neumann architecture are as follows:

-
• The final form of both programs and data is binary encoding, and both programs and data are stored in memory in binary form. (executable binary file: .bin file)
-
• Programs, data, and instruction sequences are pre-existing in the main (internal) memory, enabling the computer to extract instructions from memory quickly during operation for analysis and execution.
-
• Defines the five basic components of a computer: arithmetic unit, control unit, memory, input devices, and output devices.
The following code analysis illustrates the entire process of the von Neumann architecture.


This analysis skips the process of loading ROM data into RAM at startup (as described above, ROM -> CPU -> RAM) and focuses on the general process of the CPU with RAM in fetching, analyzing, and executing instructions.
-
• Step 1: Upon powering on, the PC fetches the initial instruction and stores it in the address register (MAR).
-
• Step 2: The MAR register searches for data at address 0 in the storage and stores it in the MDR (data register).
-
• Step 3: The MDR stores its value into the IR (instruction register) via the data bus.
-
• Step 4: The IR sends the opcode to the CU, which analyzes it and realizes that it is a data fetching instruction, prompting the MAR register to fetch data.
-
• Step 5: The MAR searches for data in the storage and places it into the MDR, which, under the control of the CU, is stored in the ACC register.
-
• Step 6: The PC counter automatically increments, repeating steps 1-4.
-
• Step 7: Because a multiplication operation is executed, the CU places the value of b from the MDR into MQ, and the CU puts the value of a from ACC into X, instructing the arithmetic logic unit (ALU) to perform the calculation, which is then placed back into ACC. If the product exceeds the limit, MQ will assist. Finally, the result y = a * b + c is achieved.
From the operation process of CPU instructions in the above diagram, it is evident that the data in storage consists of 16 bits, with the first 6 bits as the opcode and the last 10 bits as the address code. Through the CU unit’s decoding, the opcode’s first 6 bits indicate “read,” “write,” or “jump” instructions, while the last 10 bits represent the address code, which is primarily used to find the corresponding number in storage for instruction operations. The inclusion of instructions and addresses in the instruction verifies the concept of the von Neumann architecture, where data and instructions are stored in the same register.
Harvard Architecture
The von Neumann architecture is widely used in today’s computers. However, we cannot assume that all computer architectures are alike. In today’s society, besides computers, there are high-performance microprocessors like 89C51, STM32, ARM, etc. The main differences between these microprocessors and traditional computers, in my opinion, are two points.

In computers, RAM corresponds to the computer’s memory sticks, and ROM corresponds to the computer’s disk. Computer RAM is generally above 8GB, while disks are usually over 500GB. For microprocessors, both RAM and ROM rarely exceed KB; for ARM, RAM may range from hundreds of MB to 1GB, while ROM may be several GB.
For microprocessors, the CPU typically integrates ROM, RAM, SPI, ADC, USB, NVIC, and other drivers, forming a CPU+ model.

Secondly, during instruction operations, data and instructions are not stored in the same memory. Instructions have instruction storage, and data has data storage.
The structure where instructions and data are stored separately and do not interfere with each other is the most significant characteristic of the Harvard architecture.

As shown in the diagram, this is a typical Harvard architecture. The program’s instructions and data are stored separately. Compared to von Neumann architecture, Harvard architecture does not need to decode before operating data, allowing for simultaneous use of instructions and data. This undoubtedly improves operational efficiency.
Clock
Why does the CPU need a clock?

First, let’s analyze the logic circuit in the above diagram: When A = B = 1, C = 0. When the input signal changes, the logical element does not immediately respond to the input change, leading to a propagation delay.
When B changes to 0, since B is also a direct input to the XOR gate, the XOR will immediately detect that one input has changed to 0, causing the XOR output to become 1. However, due to the propagation delay, the output of the AND gate will take a little longer to change to 0. The output will thus appear as shown in the diagram:

This phenomenon is called a race condition, where an unwanted pulse signal appears in the output. A simple solution is to place an edge-triggered flip-flop at the output.

The edge-triggered flip-flop only allows data from input D to affect its output when the CLK input changes from 0 to 1. This way, all propagation delays will be hidden by the edge-triggered flip-flop, stabilizing the output at C. For example:

The dashed part represents the output state of C without the edge-triggered flip-flop. We can see that once the edge-triggered flip-flop is introduced, the output of C stabilizes, essentially eliminating propagation delays.
From the example above, we can see why the CPU needs a clock: Currently, the vast majority of microprocessors are driven by synchronous sequential circuits, which are composed of various logic gates.
As mentioned, logic gates need a small amount of time to react to changes in input (propagation delay). Therefore, a clock cycle is needed to accommodate the propagation delay, and the clock cycle should be long enough to accommodate the propagation delays of all logic gates.
Of course, there are also asynchronous sequential logic circuits that do not require clock signals for synchronization. However, while this asynchronous logic is faster than synchronous sequential circuits, it is much more complex to design and faces the race condition mentioned above. Hence, the vast majority of CPUs still require clocks for signal synchronization.
Original text:https://zhuanlan.zhihu.com/p/468286383
Article sourced from the internet, copyright belongs to the original author. If there is any infringement, please contact for removal.
Follow me 【Learn Embedded Together】, learn together, and grow together.
If you find this article good, click “Share“, “Like“, or “View“!