This article is sourced from: Arm Tech Academy
This article mainly analyzes and introduces 7 techniques for developing high-reliability embedded systems.
Today, I will introduce 7 easy-to-operate and long-lasting techniques that are very helpful in ensuring the system runs more reliably and captures abnormal behaviors.
1►
Fill ROM with Known Values
Embedded software developers are often very optimistic; they just want their code to run faithfully for a long time, and that’s it. The situation where a microcontroller jumps out of the application space and executes in an unexpected code space seems quite rare.
However, the chance of this happening is no less than that of a buffer overflow or a lost reference to an erroneous pointer. It does happen! The system behavior after this occurs will be uncertain, as the memory space defaults to 0xFF, or the values in it may only be known to God since the memory area has usually not been written to.
However, there are quite comprehensive linker or IDE tricks that can help identify such events and recover the system. The trick is to use the FILL command to fill unused ROM with known bit patterns. There are many different combinations available for filling unused memory, but if you want to build a more reliable system, the most obvious choice is to place ISR fault handlers in those locations.
If the system encounters an error, and the processor starts executing code outside the program space, it will trigger the ISR and provide an opportunity to store the processor, register, and system state before deciding on corrective action.
2►
Check Application CRC
A significant advantage for embedded engineers is that our IDEs and toolchains can automatically generate checksums for the application or memory space, allowing us to verify the integrity of the application based on this checksum. Interestingly, in many of these cases, checksums are only used when loading the program code onto the device.
However, if the CRC or checksum is kept in memory, then verifying whether the application remains intact at startup (or even periodically for long-running systems) is an excellent way to ensure that unexpected events do not occur. The probability of a programmed application changing is very low, but considering the billions of microcontrollers delivered each year and the potentially harsh operating environments, the chance of a medical device application crashing is not zero. More likely, a defect in the system could cause a sector to undergo flash write or erase, compromising the integrity of the application.
3►
Perform RAM Checks at Startup
To build a more reliable and robust system, it is crucial to ensure that the system hardware functions correctly. After all, hardware can fail. (Fortunately, software never fails; it only does what the code tells it to do, whether right or wrong). Verifying that there are no issues with the internal or external RAM at startup is a good way to ensure that the hardware can operate as expected.
There are many different methods available for performing RAM checks, but a common approach is to write a known pattern and then wait a short period before reading it back. The result should be that what is read is what was written. The truth is that in most cases, RAM checks pass, which is the result we want. However, there is a very small chance that the check may fail, providing an excellent opportunity to indicate hardware issues for the system.
4►
Use Stack Monitors
For many embedded developers, the stack seems to be a rather mysterious force. When strange things start happening, engineers are finally stumped, and they begin to think, maybe something is happening in the stack. The result is blindly adjusting the size and position of the stack, etc. But the error is often unrelated to the stack; how can one be so sure? After all, how many engineers have actually performed worst-case stack size analysis?
The stack size is statically allocated at compile time, but the stack is used dynamically. As the code executes, variables needed by the application, return addresses, and other information are continuously stored on the stack. This mechanism leads to the stack growing within its allocated memory. However, this growth can sometimes exceed the capacity limits determined at compile time, leading to stack corruption of data in adjacent memory areas.
One way to absolutely ensure that the stack works correctly is to implement a stack monitor as part of the system’s “health” code (how many engineers would do this?). The stack monitor creates a buffer zone between the stack and “other” memory areas, filling it with known bit patterns.
Then, the monitor continuously checks whether there are any changes in the pattern. If the bit pattern changes, it means that the stack has grown too large and is about to push the system into the dark abyss! At this point, the monitor can log the occurrence of the event, the system state, and any other useful data for later diagnosis of the problem.
Most real-time operating systems (RTOS) or microcontroller systems with memory protection units (MPUs) provide stack monitors. The scary part is that these features are often turned off by default, or developers intentionally disable them. A quick search online reveals that many people recommend turning off stack monitors in real-time operating systems to save 56 bytes of flash memory, which is a foolish practice!
5►
Use MPUs
In the past, it was difficult to find a memory protection unit (MPU) in a small, inexpensive microcontroller, but that situation has begun to change. Now, from high-end to low-end microcontrollers, MPUs are available, providing embedded software developers with an opportunity to significantly enhance the robustness of their firmware.
MPUs have gradually been coupled with operating systems to establish memory spaces where processing is separated, or tasks can execute their code without worrying about being stomped on. If something goes wrong, uncontrolled processing will be terminated, and other protective measures will be implemented. Be sure to pay attention to microcontrollers with this component; if available, make good use of this feature.
6►
Establish a Robust Watchdog System
A commonly favored implementation of a watchdog is to have the watchdog enabled (a good start), but also to use a periodic timer to reset the watchdog; the timer’s activation is completely isolated from any situation occurring in the program.
The purpose of using a watchdog is to help ensure that if an error occurs, the watchdog will not be reset, meaning that when work halts, the system will be forced to perform a hardware reset to recover. Using a timer independent of system activity allows the watchdog to remain reset even if the system has failed.
Embedded board developers need to carefully consider and design how application tasks integrate into the watchdog system. For example, one technique might allow each task running within a certain period to indicate that it has successfully completed its task. In this event, the watchdog is not reset, and a forced reset occurs. There are also more advanced techniques, such as using an external watchdog processor that can monitor the performance of the main processor and vice versa. Establishing a robust watchdog system is crucial for a reliable system.
7►
Avoid Dynamic Memory Allocation
Engineers who are not accustomed to working in resource-constrained environments may attempt to use features of their programming languages that allow them to use dynamic memory allocation. After all, this is a technique commonly used in computing systems where memory is allocated only when necessary.
For example, when developing in C, engineers may tend to use malloc to allocate space on the heap. An operation will execute, and once completed, the allocated memory can be returned using free for heap use.
In resource-constrained systems, this can be a disaster! One problem with using dynamic memory allocation is that errors or improper techniques can lead to memory leaks or fragmentation. When these issues occur, most embedded systems do not have the resources or knowledge to monitor the heap or handle it properly. And when they do occur, what happens if the application requests space but there is no available space to use?
The problems arising from dynamic memory allocation are complex, and properly handling these issues can be a nightmare! One alternative approach is to simplify memory allocation statically. For example, simply create a buffer of 256 bytes in size in the program instead of requesting a memory buffer of that size via malloc. This allocated memory can remain throughout the application’s lifecycle without concerns about heap or memory fragmentation issues.
The above tutorial on embedded development can help technical personnel obtain better methods for embedded systems. All these techniques are secrets that allow designers to develop more reliable embedded systems.
