I am an embedded developer who has worked on various models of microcontrollers and embedded systems. Sometimes, I often look at programs written by others and frequently find many issues in the details. Although the functionalities achieved are the same, the stability can vary. I would like to discuss some problems I have observed for everyone’s reference, solely for learning purposes, with no intention to argue. When I see the code for the Delay() function in a program, I wonder how it is possible for a microcontroller to idle to synchronize with the external world? Imagine if a PC’s CPU idles for one second, then the music would stop for one second, the screen would freeze for one second, and file downloads would halt for one second. How is that feasible? I have seen many microcontroller programs where the microcontroller spends 99.9% of its working time idling. Some might find this alarming, so let’s calculate: For a microcontroller with an 8M internal frequency, the instruction cycle is only 1/8 = 0.125us. How many single-cycle instructions can be executed in one millisecond? 1%0.125*1000 = 8000 instructions. However, I have seen that in the vast majority of programs downloaded from forums, the execution time between two delay functions is far less than 8000 instruction cycles. To be honest, many programs over 16K can hardly execute for a few milliseconds if all delay functions are removed. In other words, saying that the utilization rate of the microcontroller is less than 0.01% is actually being lenient. To solve the problem, we must first identify it. I ask everyone, why do we use delays in programs? There are many reasons; it could be due to slow peripheral speeds or to avoid the human eye’s visual persistence, among others. In short, it is about synchronizing with the external world, and we want to achieve synchronization. Therefore, these delays should be quite reasonable; I do not deny that. However, the key issue is that these delays are idling; why can’t we reclaim this time to do something else? Just think, if we could reclaim that 99.9% of the time, it would represent a significant resource. Many people have special methods to reclaim this idle time, such as doing something in the delay function. But these methods are often not universal. Below, I will discuss two of my methods: 1. Reclaiming delay time in the foreground-background mode: The foreground-background mode is the commonly used main program loop + interrupt pattern. First, we solve the problem of slow peripherals, such as serial ports, keyboards, LCDs, SD cards, etc., by establishing external buffers. For instance, serial transmission can be completed in the interrupt and saved to the buffer, while the main program operates on the buffer instead of directly manipulating the serial port. This has been widely adopted. However, I rarely see anyone using a buffer for matrix keyboards, receiving key press information in the interrupt and saving it to the buffer. Moreover, when writing data to the LCD, writing one by one to the display memory is quite wasteful; we should also establish a buffer for unified processing. There are some technical challenges in establishing such buffers. For example, in serial reception, how do we determine if the transmission is complete? We can set a timer; if no next byte is received within 1ms after receiving one byte, we consider the reception complete. This is just an idea; specific applications depend on everyone’s mastery. Some might say that in addition to slow peripherals, how do we solve visual persistence issues? We can’t let the flowing lights be so fast that the human eye can’t see them, right? This leads me to the next point: how do we reclaim these delay times? We can put them all into the timer interrupt! Some may argue that books and tutorials say to keep interrupt handling short; if you make the entire interrupt too long with many judgments, it won’t work. This is a dogmatic mindset that rigidly adheres to the text. We can handle it in the interrupt like this: void (*Task)(void); ISR { (*Task)(void); } The content used in the interrupt is called through a function pointer, allowing the main program to change the task to be executed as needed and modify the task’s cycle. All judgments are executed in the main program, then change the pointer to determine the next task in the interrupt. In this way, the main program can allocate tasks, and there is still plenty of capacity to handle many things. For instance, many keyboards, LED matrices, and digital tubes need real-time responses, which can complicate programming and lead to slow responses. In fact, as long as we reclaim the delay time, handling these becomes very relaxed. Some may argue that some projects do not require such stringent timing, so why reclaim the time? In this case, you can use a dead loop to scan events for real-time responses. Your system will be much more responsive compared to the original idle loop delay. 2. A mutated cooperative kernel: Let’s talk about the kernel of embedded operating systems. Simply put, it is a task scheduler that allows multiple tasks to run on the same CPU simultaneously. The so-called simultaneous execution is relative; it just means that the first task runs for a few milliseconds, and the second task runs for a few milliseconds. It appears to execute simultaneously. As for the difference between preemptive and cooperative kernels, you can look it up. When it comes to embedded operating systems suitable for microcontrollers, people might mention systems like uCosII, FreeRTOS, etc. Many people are quite resistant to these operating systems, and their reasons are: 1. These operating systems occupy a lot of RAM and ROM. 2. The so-called real-time of these real-time operating systems is relative to non-real-time operating systems; they are actually slower compared to bare-metal systems. These reasons are not without merit, as these commercial operating systems are generally preemptive kernels, which guarantee that the highest priority task responds within a determinable time. Their advantage is that task switching time is predictable and does not vary with the number of tasks. With this predictability, they shine in commercial products due to their temporal stability. However, their disadvantages are also evident: interrupt-level ticks waste a lot of time, and simultaneous calls between tasks introduce synchronization issues, leading to mechanisms like semaphores and mailboxes that waste a lot of RAM and ROM. In summary, preemptive kernels are stable and quantifiable, advantageous on higher-end microcontrollers, but they require significant trimming on 8-bit machines and may not be suitable. The core idea of cooperative kernels is that they do not guarantee the fastest speed for the highest priority task, but rather ensure that all tasks achieve the fastest average speed! As I mentioned earlier, it is difficult for the code between two consecutive delay functions to exceed 1ms, or even 100us; we can ignore this. So, for 10 tasks, the first one finishes and voluntarily relinquishes control of the microcontroller to the second task, which then does the same for the third task. There are no gaps between the 10 tasks; whenever a task requires a delay, it voluntarily relinquishes control. Based on this idea, we achieve the goal of reclaiming idle delay times, and because each task gives up control voluntarily after execution, there are no synchronization issues like in preemptive kernels. There is essentially no need for mechanisms like mailboxes or semaphores, leading to very low RAM and ROM requirements. Thus, cooperative kernels are very suitable for 8-bit machines. However, many embedded systems books mistakenly endorse preemptive kernels without context, leading to misunderstandings. Moreover, the authority of systems like uCos has prompted many RTOS authors to follow suit without reasonably analyzing the suitability for 8-bit machines. There are no cooperative kernels in commercial systems, and there are few excellent cooperative kernels in civilian use, all based on traditional ticks. Traditional cooperative kernels require timer interrupts as clock references, which can intermittently interrupt tasks, causing unnecessary losses, which is not what we desire. We can simply keep the timer running at a large division factor without giving it a chance to generate interrupts. When a task is about to give up its rights, it reads the timer to serve as the clock reference and then resets it. The method cannot be explained in a sentence or two, but what is the effect? It allows tasks to operate in a state free from interference, akin to bare-metal operation, and only relinquishes control when a delay is needed, allocating the delay time to other tasks. This aligns perfectly with the purpose of my entire article — reclaiming idle delay times. Such a kernel will be very small in size and operate similarly to bare-metal, merely doing other things during idle delay times. There are no special requirements for users, unlike previous systems that were so complex.
Below are ten summarized experiences I would like to share with everyone.
1. Try to avoid calling delay functions. Programs without an operating system can only loop in while(1). If a large number of delays are called here, it will consume CPU resources heavily; delays essentially make the CPU idle. Only what is in the interrupt will execute. If you are just making a program that blinks an LED once a second, it is simple to directly call the delay function. However, in actual projects, there are often many tasks to be done in the main loop, which does not work for time-sensitive situations. To avoid using delays, you can use timer interrupts to generate a flag. When the time is up, set the flag to 1. In the main program, you only need to check the flag; when it is set to 1, execute once and then clear the flag. At other times, you can do other things, instead of waiting. The best example is the display of digital tubes, which uses interrupts to control the display, as seen in our examples. Then there is key detection; general programs do while(!key) waiting for the key to be released. If the key is held down, the subsequent program will be forever blocked. In fact, you can do key flag detection for both falling and rising edges to avoid this issue. 2. Write code as concisely as possible to avoid repetition. In the book ‘Learn Microcontrollers in 10 Days’, I saw the part where the code for displaying digital tubes is written. It selects one position, sends data, selects another position, and sends data, repeating this process. The code repetition is too high, which not only occupies too much class storage but also has poor efficiency and readability. It merely implements functionality; in actual programming, a loop can be used, either a for loop or a while loop. Such code appears to be more sophisticated. 3. Use macro definitions reasonably. If a variable or register is frequently used in a program, you can use macro definitions to define a new name to replace it. The benefit is ease of modification; for instance, if the data bus of the LCD is connected to P1 and you want to change it to P0, you only need to modify it in the macro definition. The compiler will automatically replace the defined name with the actual name during compilation. 4. Use the smallest data types possible. For example, if a variable’s value range is 0-255, define it as unsigned char. Of course, you could define it as unsigned int, but that would waste memory and slightly lower operational efficiency. If the data has no negative numbers, define it as an unsigned type. Avoid defining floating-point or double precision (which occupies 8 bytes) types, as these consume significant CPU resources during operations. For instance, if the voltage range is 0-5V and needs to be precise to three decimal places, you can multiply the collected data by 1000. Even at its maximum, it would only reach 5000. Then, after collecting several samples, apply a filtering algorithm, and finally, to get the voltage, just place a decimal point after the first digit. Defining the variable as an unsigned int type will suffice. 5. Avoid using multiplication and division. These are highly CPU-intensive operations. Looking at the assembly code, you will find that a multiplication or division operation can compile into ten or even dozens of lines of code. If you are multiplying or dividing by powers of 2, you can use << or >> to achieve this. Such bitwise operations are calculated at compile time, resulting in concise code and higher operational efficiency. However, be particularly careful about operator precedence. 6. Prefer using compound assignment operators. What is the difference between a=a+b and a+=b? The former calculates a+b first, saves the result in the ACC register, and then assigns the ACC register’s value to a. The latter directly assigns the value of a+b to a, saving one step. Although this only saves one instruction, when this operation loops thousands or tens of thousands of times, the effect becomes apparent. The same applies to other operations like -=, *=, /=, %=, etc. 7. Try not to define global variables. Let’s first look at the similarities and differences between local variables, global variables, static local variables, and static global variables: (1) Local variables: Variables defined within a function or compound statement, allocated in the dynamic storage area, dynamically allocated during calls, and automatically released at the end of the function or compound statement. (2) Static local variables: When defining local variables in a function, if you add the static declaration, the variable becomes a static local variable, allocated in the static storage area, and not released during program execution. Static local variables can only be used within that function; their values are assigned during compilation (if not assigned during definition, they default to 0 (for numeric variables) or an empty character (for character variables)). Static local variables do not automatically release their values after function calls. (3) Global variables: Variables defined outside functions are called global variables. Global variables are allocated in the static storage area and are not released during program execution. Functions in the file can call this global variable; other functions in different files need to declare extern to call the global variable. (4) Static global variables: Variables defined outside functions with a static declaration become static global variables. Static global variables are allocated in the static storage area and are not released during program execution. Static global variables are assigned during compilation (if not assigned during definition, they default to 0 (for numeric variables) or an empty character (for character variables)). They can only be used within the current file. Generally, you should define them as local variables, as they not only run more efficiently but are also easier to port. Local variables are mostly located in the MCU’s internal registers, and in most MCUs, using register operations is faster than data memory, with more flexible instructions, beneficial for generating higher-quality code. Moreover, the registers and data memory occupied by local variables can be reused in different modules. When variables needed in interrupts are required, they should be defined as global variables and marked with volatile to prevent compiler optimization. If the data is read-only, such as the off codes for digital tubes or the character library for Chinese characters, it should be placed in ROM to save RAM. For the 51 microcontroller, this is done by adding code, while higher-end microcontrollers use the const modifier. 8. Choose suitable algorithms and data structures. You should be familiar with algorithm languages and understand the advantages and disadvantages of various algorithms. Specific information can be found in relevant references; many computer books introduce this. Replace slower sequential search methods with faster binary search or unordered search methods, and replace insertion sort or bubble sort with quick sort, merge sort, or heap sort to significantly improve program execution efficiency. Choosing a suitable data structure is also crucial. Pointers, which are variables containing addresses, can easily address variables they point to. Using pointers allows easy movement from one variable to another, making them particularly suitable for manipulating large numbers of variables. Array and pointer statements are closely related; generally, pointers are more flexible and concise, while arrays are more intuitive and easier to understand. For most compilers, using pointers generates shorter code than using arrays, leading to higher execution efficiency. However, in Keil, the opposite is true; using arrays generates shorter code than using pointers. 9. Use conditional compilation. Generally, when compiling C language programs, all programs participate in the compilation. However, sometimes you may wish to compile only a portion of the content under certain conditions. This is what conditional compilation is for. Conditional compilation can select different compilation ranges based on actual situations, resulting in different code. 10. Embedded assembly – the killer skill. Assembly language is the highest efficiency computer language. In general project development, C language is typically used because embedding assembly affects platform portability and readability; assembly instructions for different platforms are incompatible. However, for some dedicated programmers who require extreme runtime efficiency, they embed assembly within C language, known as