In our country, very few friends in embedded programming have graduated from formal computer science programs; most come from automatic control or electronics-related majors. These individuals have strong practical experience but lack theoretical knowledge. A large portion of those who graduated from computer science programs tend to work on online games or web pages, higher-level applications independent of the operating system, and are generally reluctant to engage in the embedded industry, as this path can be challenging. They possess solid theoretical knowledge but lack knowledge in circuits and other relevant areas, making it difficult to learn specific knowledge required in embedded systems.
Being able to view embedded issues from the perspective of PC programming is the first step; learning to apply embedded programming concepts is the second step; and combining PC and embedded thinking in practical projects is the third step. Many friends transition from PC programming to embedded programming. In China, very few friends in embedded programming have graduated from formal computer science programs; most come from automatic control or electronics-related majors. These individuals have strong practical experience but lack theoretical knowledge. A large portion of those who graduated from computer science programs tend to work on online games or web pages, higher-level applications independent of the operating system, and are generally reluctant to engage in the embedded industry, as this path can be challenging. They possess solid theoretical knowledge but lack knowledge in circuits and other relevant areas, making it difficult to learn specific knowledge required in embedded systems.
Although I have not conducted an industry survey, from what I have seen and the personnel I have recruited, engineers in the embedded industry either lack theoretical knowledge or practical experience. It is rare to find someone who possesses both. The root cause lies in the issues of China’s university education. I will not delve into this matter to avoid a war of words. I would like to present a few practical examples from my experience to raise awareness of certain issues when working on embedded projects.First Issue:A colleague developed a serial port driver under uC/OS-II, and no issues were found during testing of the driver and interface. An application was developed that required a communication program, and the serial port driver provided a function to query the number of characters in the driver buffer: GetRxBuffCharNum(). The higher level needed to receive a certain number of characters before it could parse the packet. The code written by a colleague can be represented in pseudocode as follows:
bExit = FALSE;do { if (GetRxBuffCharNum() >= 30) bExit = ReadRxBuff(buff, GetRxBuffCharNum());} while (!bExit);
This code checks if there are more than 30 characters in the current buffer and reads all characters from the buffer until successful. The logic is clear, and the thought process is straightforward. However, this code does not work properly. If it were on a PC, there would be no issues, and it would function normally. But in embedded systems, the outcome is uncertain. My colleague was quite frustrated and did not understand why. When he approached me for help, I looked at the code and asked him how GetRxBuffCharNum() was implemented. Upon inspection, I found:
unsigned GetRxBuffCharNum(void){ cpu_register reg; unsigned num; reg = interrupt_disable(); num = gRxBuffCharNum; interrupt_enable(reg); return (num);}
It is evident that during the loop, the area between interrupt_disable() and interrupt_enable() is a global critical section, ensuring the integrity of gRxBufCharNum. However, because the CPU frequently disables and enables interrupts in the outer do { } while() loop, this time is very short. In reality, the CPU may not respond to the UART interrupt properly. Of course, this is related to the baud rate of the UART, the size of the hardware buffer, and the speed of the CPU. The baud rate we used was very high, around 3Mbps. The start and stop signals of the UART occupy one bit each. One byte takes 10 cycles to transmit. At a baud rate of 3Mbps, it takes about 3.3us to transmit one byte. How many CPU instructions can be executed in 3.3us? At 100MHz ARM, it can execute about 150 instructions. So, how long does it take to disable interrupts? Generally, disabling interrupts on ARM requires more than 4 instructions, and enabling them takes another 4 instructions. The code for receiving the UART interrupt actually consists of more than 20 instructions. Therefore, there is a possibility of losing communication data bugs, which manifests at the system level as unstable communication.Modifying this code is relatively simple; the easiest way is to modify it from the higher level. That is:
bExit = FALSE;do { DelayUs(20); // Delay 20us, typically achieved with a busy loop num = GetRxBuffCharNum(); if (num >= 30) bExit = ReadRxBuff(buff, num);} while (!bExit);
This way, the CPU has time to execute the interrupt code, thereby avoiding the loss of information caused by frequently disabling interrupts. In embedded systems, most RTOS applications do not include serial port drivers. When designing code, there is often insufficient consideration of how the code integrates with the kernel, which can lead to deeper issues. RTOS is called RTOS because of its quick response to events; quick response depends on the CPU’s response speed to interrupts. Drivers in systems like Linux are highly integrated with the kernel, running in kernel mode. Although RTOS cannot copy the Linux structure, it can draw some lessons from it.From the above example, it is clear that embedded development requires developers to understand every aspect of the code.
Second Example:Simultaneously driving a 14094 serial-to-parallel chip. The serial signal is simulated using I/O because there is no dedicated hardware. A colleague casually wrote a driver, but after 3 or 4 days of debugging, there were still issues. I couldn’t stand it any longer, so I took a look and found that the control of the parallel signal was sometimes normal and sometimes not. I looked at the code, which can be roughly represented in pseudocode as:
for (i = 0; i < 8; i++){ SetData((data >> i) & 0x1); SetClockHigh(); for (j = 0; j < 5; j++); SetClockLow();}
Sending the 8 bits of data on each high signal from bit0 to bit7 should be normal. I couldn’t see where the problem was. After thinking carefully, I checked the 14094 datasheet and realized that the 14094 requires the clock high signal to last for 10ns, and the low signal must also last for 10ns. This piece of code only implemented the delay for the high signal and did not account for the low signal delay. If interrupts occur during the low signal period, this code might work. However, if the CPU does not execute during the low signal, it will not work properly. Hence, it works inconsistently.Modification is also relatively simple:
for (i = 0; i < 8; i++){ SetData((data >> i) & 0x1); SetClockHigh(); for (j = 0; j < 5; j++); SetClockLow(); for (j = 0; j < 5; j++);}
This ensures everything works correctly. However, this code is still not very portable because if the compiler optimizes it, it may lose these two delay loops. If lost, it cannot guarantee that the high and low signals last for 10ns, and thus will not work correctly. Therefore, truly portable code should implement this loop as a nanosecond-level DelayNs(10);Like Linux, at power-up, measure how long the nop instruction takes to execute, how many nop instructions execute in 10ns, and then execute a certain number of nop instructions. Use compiler directives to prevent optimization or special keywords to prevent the delay loop from being optimized away, such as in GCC:
__volatile__ __asm__("nop;\n");
From this example, it is clear that writing good code requires substantial knowledge support. What do you think?
Embedded systems often lack operating system support, or even when there is operating system support, the functions provided by the operating system can be very limited. Therefore, much code cannot be as free and unrestricted as PC programming. Today, let’s talk about memory allocation issues. Memory fragmentation is something everyone is familiar with. However, in embedded systems, memory fragmentation is the most dreaded issue and is the number one killer of system stability. I once worked on a project where there were many malloc and free operations of varying sizes, ranging from over 60 bytes to 64KB. We used an RTOS for support. At that time, I had two options: one was to use the malloc and free from the C standard library, and the other was to use the fixed memory allocation provided by the operating system. Our system was designed to run stably for more than three months. In reality, it crashed after running for about six days. Various issues were suspected, but ultimately it was determined that the memory allocation was the root cause, as long-term and high-volume memory allocation led to fragmented and non-contiguous memory. Although there was ample space, contiguous memory could not be allocated. When a large space request was made, the system would crash. To meet the original design requirements, we simulated the entire hardware on a PC, ran the embedded code on the PC, and reloaded malloc and free to create a complex statistical program. After running for several days, we extracted and analyzed the data, and despite the varied memory requests, some patterns emerged. We categorized requests under 100 bytes, 512B, 1KB, 2KB, and below 64KB. We then added a 30% margin to each category and implemented fixed memory allocation, greatly extending the time the system could run stably. Embedded systems are like this; it’s not about using primitive methods, but about meeting performance requirements.
Memory overflow issues are even more terrifying in embedded systems than in PC systems! They often occur without detection. It’s hard to imagine, especially for beginners in C/C++ who are unfamiliar with pointers and cannot even troubleshoot. In PC systems, the presence of an MMU protects against severe memory overflows, preventing catastrophic consequences. However, embedded systems often lack an MMU, leading to significant differences; the system code may be corrupted but still run. Only the CPU and God know what is actually running. Let’s take a look at this piece of code:
char *strcpy(char *dest, const char * src){ assert(dest != NULL && src != NULL); while (*src != '\0') { *dest++ = *src++; } *dest = '\0'; return (dest);}
This code is for copying a string, and it works fine on a PC. However, in embedded systems, one must guard against the src not actually ending with ‘\0’. If not, it could lead to tragedy. When will it terminate? Only God knows. If this code miraculously completes execution, don’t expect the program to run correctly, as the memory area pointed to by dest would likely be severely corrupted. To maintain compatibility with the standard C/C++ library, there is not much that can be done, so this issue must be left for programmers to check themselves.Similarly,
The same issue arises with memory copying; one must guard against passing a negative value for n. This specifies how many bytes to copy, and if a negative value is coerced into a positive one, it becomes a very large positive number, corrupting all memory after dest…Memory pointers in embedded systems must undergo strict checks before use, and memory dimensions must be rigorously debugged. Otherwise, tragedy is hard to avoid. For instance, a function pointer, even if assigned NULL or 0 in embedded systems, if it’s an ARM, may not trigger an exception error but directly reset, as calling this function pointer could lead to executing code from address 0, which is the first instruction executed after power-up on ARM. This kind of tragedy is much more severe than on a PC, where the MMU would certainly trigger an undefined instruction error, drawing the programmer’s attention. In embedded systems, all of this is left to the programmer to find.
Memory overflow can occur at any unguarded moment. How large is the heap allocated for the entire front and back-end system (or operating system)? What is the usual call depth of the system (maximum)? How much stack space is occupied? It’s not enough to just look at the program’s functional correctness; these parameters must also be tracked. Otherwise, if there is an overflow in any one place, it can be fatal for the system. Embedded systems require long continuous operating times, with strict stability and reliability requirements, necessitating careful fine-tuning of these systems.
Debugging embedded systems is often very complex, and the available means are not as abundant as in PC programming, leading to significantly higher development costs than PC systems. The main debugging methods for embedded systems are step tracing represented by JTAG and the printf debugging technique.
However, these two debugging methods do not necessarily solve all problems in embedded systems. JTAG requires the debugger to have a debugging device (which can be quite expensive) connected to the target system. Using software like GDB Client to log into the debugging device allows tracking of the running program. To be honest, this method is the ultimate debugging approach for embedded systems and is relatively effective. However, it still has some shortcomings; when too many breakpoints are set, exceeding the hardware limits, certain low-end CPUs do not support more breakpoints, necessitating the use of software simulation by JTAG or employing software traps (software interrupts or exceptions) to implement breakpoints. The mechanics are quite complex. To simplify, 1. Long-term debugging is not feasible and is relatively unstable; 2. It may affect the program’s runtime behavior through timing influences. When a JTAG system is connected, hardware-implemented breakpoints do not impact the system’s running speed, but software-implemented breakpoints inevitably sacrifice some performance, thus reliability is compromised. When too many breakpoints are set, and the system enters a critical region, it may cause breakpoints to become ineffective. This is because implementing global critical regions in embedded systems often requires disabling interrupts, and some CPUs do not have non-maskable interrupts. When the number of breakpoints exceeds a certain threshold, software breakpoints must be used, which requires the interrupts to be functioning…
Especially for debugging timing issues and high-speed communication code, JTAG is not very helpful. The communication process is often rapid, with packets arriving in quick succession to complete a complete action. If it’s high-speed communication, breakpoints cannot allow the program to function. Thus, the printf debugging method is the only option. The printf debugging method is effective, but there are several issues to consider: embedded systems often lack screens, and printf output is done via serial ports. The serial port works in two modes: polling and interrupt or DMA. Regardless of the mode, debugging output from printf can only be output using polling methods; never use interrupt or DMA methods. Whether in front-end programs or operating systems, there are inconvenient times; perhaps printing is needed within a global critical section (where interrupts are disabled), or printing within an interrupt (where nesting interrupts is not allowed), or printing within certain drivers (where many cooperative devices have not been initialized, and memory allocation and interrupts may not function correctly). In these situations, using UART interrupts to output characters is unwise. Therefore, debugging output can only be done using polling methods. Do not fantasize about using some fancy methods; it is unnecessary. In short, it is unreliable! Since debugging is being carried out, reliable output is the primary requirement. Because of this, printf can also affect code efficiency; the serial port’s maximum baud rate is 115200bps, and the faster the CPU, the more time is wasted waiting for the previous character to finish outputting, which is completely wasted time. Therefore, printf should be used with some strategy, printing only in key timing positions without arbitrary or careless use… drowning out bugs.
These two methods do not effectively solve all problems. In practice, if an embedded system has one or two LED lights, trying to turn them on and off through I/O ports in special situations can also indicate the program’s status. This method is suitable for debugging interrupts and critical regions. Turning on an LED requires very little time, essentially just one memory read/write command if the I/O port register is CPU unified addressing. The impact is minimal. When debugging complex timing issues, free I/O ports can be pulled low or high in special cases, and then analyzed using a digital oscilloscope or logic analyzer. This is especially significant for analyzing code execution frequency, execution time, optimization effects, and overall performance improvements. For simple microcontrollers, manufacturers’ development software often has timing statistics functionality. However, for microcontrollers with cache and MMU, timing statistics are not accurate, often less precise than measurements with an oscilloscope. If no oscilloscope is available, the CPU’s internal timer can also achieve timing statistics, which needs to be used in conjunction with printf.
I once had a colleague who debugged Philips ARM7. Since the external RAM of Philips ARM7 is all static RAM, even when the CPU crashes, as long as there is no power, the data in SRAM will not be lost. Since SRAM and internal SRAM share the same addressing, accessing it is just one read/write instruction, very fast. Utilizing this feature, he marked all program modules and points. When the system runs abnormally, the first thing after resetting the ARM7 is to retrieve and print the data before the reset. This clever method allows for effective debugging of ARM7 code. If only SDRAM is used, this method cannot be applied since once the system resets, the SDRAM loses its data without refreshing.
Everyone knows that the biggest challenge in embedded systems is the simultaneous maturation of hardware and software. When problems arise, it’s often unclear whether they stem from software or hardware issues. Of course, most issues can be resolved virtually, but virtual solutions are ultimately just that—virtual. Once on the actual board, many problems still arise. The embedded field, especially at the low level, is composed of both software (drivers) and hardware. Solving these issues requires knowledge of both areas, raising the bar for personnel quality. I have encountered many tricky problems, all complex system issues.1. A system requires continuous operation for 24 hours, even during power outages, while preserving the state prior to the outage.When power is normal, the system must restore the state prior to the outage and continue working.In practice, we implemented software to achieve this, but the actual results were not as expected. Out of ten thousand power outages, there were always a few dozen abnormal cases; and since they could not be reproduced, we could only guess. Because the system lost power, it was difficult to debug; with JTAG connected, if the system lost power, the target board would also lose power, making step-by-step debugging impossible. The original design idea was to use capacitors to store a small amount of energy to continue working after a power outage, preserving the state and then entering standby mode. Testing the power outage detection signal showed no issues. Later, this problem became a mystery…This system is divided into two modules: the working module and the control module. The control module has capacitors to continue power, while the working module does not. Therefore, when a power outage occurs, the entire system does not lose power simultaneously. When the control module detects a power outage, the working module has likely already lost power, thus failing to correctly transmit the related data back, causing the control module to malfunction. The timing of the two power outages is very close, making it difficult to determine which occurred first. The solution was simple: synchronize the power outage detection module with the working module as the master, and then there were no issues.
2. Again, regarding power outage protection, after simulating power outages with a relay successfully thousands of times, we finally conducted a complete machine experiment, only to frequently find that power outage protection did not function correctly.After closely examining the circuit, everything appeared normal and identical. As a result, the engineering department blamed our R&D department for not testing thoroughly and releasing defective products. Such a disappointment. After careful analysis, we believed that the possibility of software anomalies was very low. The main issue was still with the hardware; the supercapacitors on the hardware might not store enough energy during frequent power outages to complete the protection process. So, what exactly caused the frequent power outages? It sounds unbelievable, but by continuously tracking the control board’s power with a digital oscilloscope, we discovered that the three-phase AC needed to connect to a phase protector, which frequently switched on and off during system operation (possibly related to the system’s state). The solution was simple: connect the controller’s power supply before the phase protector.
These issues may seem like hardware problems, but they are often encountered during product debugging. These problems require software personnel to confirm whether bugs in the software can cause such situations, and hardware engineers must also confirm the hardware. Of course, the process of confirming hardware is lengthy and complex, and debugging methods are very limited. In contrast, debugging embedded software tends to be more cost-effective and efficient. Therefore, embedded software personnel often spend considerable time confirming software issues before suspecting hardware. As embedded developers, understanding the basic principles of hardware, combining them with software operational principles, and collaborating with hardware engineers to experiment and pinpoint errors is a very effective approach.
Some friends online often ask me questions regarding low-level knowledge, including issues related to multiprocessors. I have limited knowledge about multiprocessor issues, but I would like to discuss the application of multiple CPUs in the embedded field. Embedded systems are, after all, one of the application areas of computer science. To excel in this field, solid theoretical knowledge of computer science is essential.
First, multiprocessors can be categorized into several types:Processors of the same model, all identical, connected through a communication method such as multi-port RAM, RapidIO, Gigabit Ethernet, or PCI-E;Processors of different models, even with completely different architectures, connected through a communication method as mentioned above, such as multi-port RAM, RapidIO, etc.;Multiple CPUs integrated into a single chip. These CPUs share everything and belong to a more tightly coupled system than multi-port RAM.Why use multiprocessors?For large-scale parallel computing;To utilize the characteristics of multiple CPUs, such as using a DM642-like scheme in complex video solutions. To leverage the floating-point computing capability of DSP while also using the transaction computing ability of ARM;To simply improve system performance.For ordinary applications, improving system performance is the basic starting point. However, using multiprocessors in embedded systems is not a simple matter. The software design for multiprocessors is challenging, and debugging is also a significant issue.If an operating system is not used, and a front-back system is adopted, one must design a communication algorithm and a result integration system. In such systems, many components must be designed from scratch, where the reliability and fault tolerance of the bus design are crucial. Therefore, if possible, using a mature and stable operating system to support multiprocessors can significantly reduce development difficulty. However, finding such an operating system is not easy.First, clarify your application requirements: do you need thread or process migration? Do you need processor balancing? For multiprocessors, if thread or process migration is not supported, there can be no dynamic balancing of processor tasks; otherwise, one can only specify in advance which processor a thread or process will run on. For heterogeneous multiprocessors, thread migration and process migration have little practical significance. For profit-driven companies, there is currently no practical value in discussing this. Therefore, migration is limited to symmetric processors. However, not every process can migrate on symmetric processors. For symmetric processors, the operating system encapsulates the underlying layers, allowing users to develop as if they were developing for a single CPU. Of course, it cannot be completely identical to a single CPU, but it at least alleviates much difficulty.Many friends ask me if RTEMS can run on CMP multiprocessors like x86. Of course, but the design differs from ordinary symmetric multiprocessors. Because CPUs on CMP share many resources, including interrupts, memory, and buses, their address spaces are generally consistent. For RTEMS, it adopts a heterogeneous approach to support symmetric processors; thus, if there are several CPUs, each must run its own instance of RTEMS. Therefore, communication becomes particularly important; multiple RTEMS need multiple system ticks, and where do these ticks come from? Since CMP shares many resources, users must manually specify interrupt sources and allocate memory space for RTEMS, which leads to the situation where multiple CPUs running RTEMS have many different drivers. This tightly coupled system is very challenging to manage.
Compared to CMP, SMP composed of identical CPUs is simpler because all drivers are the same. Communication drivers may require special handling due to communication methods, but this greatly reduces development pressure and debugging difficulty. It’s like every CPU is a core; that would be disastrous, especially regarding debugging issues. Therefore, from an economic perspective, I prefer multi-processor systems composed of multiple identical single CPUs.
Often, for heterogeneous processors, RTEMS can also handle them easily, but there is still a problem: multiple cores require their own RTEMS support, making development inconvenient. Moreover, debugging an operating system is relatively complex. Therefore, the practical solution is that among heterogeneous processors, the processor responsible for transaction operations runs the operating system, while the processor responsible for calculations adopts a front-back system, simply communicating through shared memory to respond to the operating system’s computation requests. This greatly reduces development difficulty; after all, the operating system treats the DSP like a hardware register, allowing a few register writes to obtain results or input a set of astronomical data to yield a complex result. Anyway, in summary, this reactive processing method is the approach adopted by the vast majority of engineering projects. It is simple, reliable, and practical.
It seems that the use of multiprocessors in embedded systems is highly application-dependent.