Sharing Some Highly Practical Embedded Software Debugging Techniques

In the process of embedded software development, it is generally true that more time is spent on testing than on coding, often at a ratio of 3:1 (or even more).

This ratio decreases as your programming and testing skills improve, but regardless, software testing is crucial for most people.

Many years ago, a developer asked a question to gain a deeper understanding of embedded systems: "How can I know and understand what my system is really doing?"

This question was surprising because no one had asked it before, while contemporaneous embedded developers mostly asked questions like "How can I make my program run faster?" or "Which compiler is the best?"

Faced with this unusual yet mature question, many people might not know how to respond. Below, I will discuss common methods and tips for finding bugs through software testing.

1

<strong>Understand the Use of Tools</strong>

Embedded systems typically have high reliability requirements. Failures in embedded system safety can lead to catastrophic consequences, and even non-safety systems can cause significant economic losses due to mass production.

This necessitates strict testing, confirmation, and validation of embedded systems, including embedded software. As more fields use software and microprocessors to control various embedded devices, the importance of rapid and effective testing of increasingly complex embedded software becomes more pronounced.

Just as a mechanic needs tools, a good programmer should be proficient in using various software tools. Different tools have different scopes of use and functionalities. With these tools, you can see what your system is doing, what resources it is consuming, and how it interacts with external elements. Problems that have troubled you for days may be easily resolved with a tool, but unfortunately, you might not know about it.

So why do so many people only think of using testing tools after struggling for a long time? There are many reasons, but two main ones are:

One is fear;
The other is laziness;
Fear arises because integrating testing tools or modules into the code requires skill and may introduce new errors, so they often prefer to rely on repeatedly modifying and recompiling the code to eliminate bugs, which ultimately proves futile.

Laziness stems from their habit of using simple testing methods like printf.

Here are some commonly used testing tools in embedded systems:

(1) <strong>Source-level Debugger</strong> ๐Ÿ‘‰ [Source-levelDebugger]
This type of debugger generally provides single-step or multi-step debugging, breakpoint setting, memory inspection, variable viewing, and other functions, making it the most fundamental and effective debugging method for embedded systems. For example, gdb provided by VxWorks Tornado II belongs to this category.

(2) <strong>Simple and Practical Print Display Tool</strong> ๐Ÿ‘‰ [printf]
Printf or other similar print display tools are probably the most flexible and simplest debugging tools. Printing various variables during code execution can inform you about the execution status of the code. However, printf can significantly interfere with normal code execution (generally, printf consumes a considerable amount of CPU time), so it should be used cautiously, preferably with a print switch to control output.

(3) <strong>ICE or JTAG Debugger</strong> ๐Ÿ‘‰ [In-circuitEmulator]
ICE is a device used to emulate the CPU core, allowing real-time inspection of the CPU's internal workings without interfering with normal operations. It can provide complex conditional breakpoints, advanced real-time tracing, performance analysis, and port analysis, similar to what desktop debugging software offers.

ICE typically has a special CPU known as a bond-out CPU. This is a CPU that has been opened from its package and can access internal signals through special connections, which are not visible when the CPU is packaged. When used in conjunction with powerful debugging software on a workstation, ICE can provide the most comprehensive debugging capabilities available.

However, ICE also has some drawbacks: it is expensive; it cannot operate at full speed; and not all CPUs can be used as bond-out CPUs. From another perspective, these bond-out CPUs are unlikely to be replaced promptly by newly released CPUs. JTAG (Joint Test Action Group), although initially developed to monitor IC and circuit connections, has expanded its use to include debugging support.

(4) <strong>ROM Monitor</strong> ๐Ÿ‘‰ [ROMMonitor]
A ROM monitor is a small program that resides in the embedded system's ROM and communicates with debugging software running on a workstation via serial or network connections. This is a cost-effective method and also the most basic technology. It requires only a communication port and a small amount of memory space, with no other specialized hardware needed.

It provides functions such as code downloading, run control, breakpoints, single-step execution, and observation and modification of registers and memory. Since the ROM monitor is part of the operating software, it only works when your application is running. If you want to check the CPU and application status, you must stop the application and re-enter the ROM monitor.

(5) <strong>Data Monitor</strong> ๐Ÿ‘‰ [DataMonitor]
This monitor can display specified variable contents without stopping CPU operation and can also collect and graphically display the changes of various variables.

(6) <strong>Operating System Monitor</strong> ๐Ÿ‘‰ [OperatingSystemMonitor]
The operating system monitor can display events such as task switching, semaphore sending and receiving, and interrupts. On one hand, these monitors can present the relationships and timing connections between events; on the other hand, they can also diagnose issues like semaphore priority inversion, deadlocks, and interrupt delays.

(7) <strong>Profiler</strong> ๐Ÿ‘‰ [Profiler]
This tool can be used to test where the CPU is spending its time. Profiler tools can inform you about system bottlenecks, CPU usage, and areas that need optimization.

(8) <strong>Memory Tester</strong> ๐Ÿ‘‰ [MemoryTester]
This tool can identify memory usage issues, such as memory leaks, fragmentation, and crashes. If the system exhibits unpredictable or intermittent problems, it is advisable to use memory testing tools to investigate.

(9) <strong>Execution Tracer</strong> ๐Ÿ‘‰ [ExecutionTracer]
This tool can show which functions the CPU executed, who called them, what the parameters were, and when they were called. This tool is mainly used to test code logic and can help identify anomalies among numerous events.

(10) <strong>Coverage Tester</strong> ๐Ÿ‘‰ [CoverageTester]
This tool primarily shows which code the CPU has executed and indicates which code branches have not been executed. This helps improve code quality and eliminate dead code.

(11) <strong>Home-made Tester</strong> ๐Ÿ‘‰ [Home-madetester]
In embedded applications, sometimes it is necessary to write custom tools for specific testing purposes. I once developed a video stream recording tool that greatly assisted in testing video conferencing data flow and changes, helping the company identify several deeply hidden bugs.

2

<strong>Detect Memory Issues Early</strong>

Memory issues can be very harmful and difficult to troubleshoot, mainly falling into three categories: memory leaks, fragmentation, and crashes. It is crucial to adopt a clear attitude towards memory issues: early detection leads to early "treatment." Memory leaks are the most notorious in software design, primarily due to continuously allocated memory that cannot be released in a timely manner, eventually exhausting the system's memory.

Even careful programming veterans can sometimes encounter memory leak issues. Friends who have tested for memory leaks likely have a profound experience: memory leak issues are often deeply hidden and difficult to detect through code review. Some memory leaks may even occur within libraries. This could be due to bugs in the library itself or because programmers misused them due to a lack of proper understanding of their interface documentation.

In many cases, most memory leak issues are undetectable but may manifest as random failures. Programmers often blame such phenomena on hardware issues. If users do not have high expectations for system stability, rebooting the system may not be a big deal; however, if users expect high stability, such failures can lead to a loss of confidence in the product, indicating that your project is a failure.

Due to the significant harm caused by memory leaks, many tools have been developed to address this issue. These tools detect memory leaks by searching for unreferenced or reused code blocks, garbage collection, library tracking, and other techniques. Each tool has its pros and cons, but overall, using them is better than not using them. In summary, responsible developers should test for memory leaks to prevent issues before they arise.

Memory fragmentation is even more hidden than memory leaks. As memory is continuously allocated and released, large blocks of memory break down into smaller blocks, leading to fragmentation. Over time, when a large block of memory is requested, it may fail. If the system has sufficient memory, it may last longer, but ultimately, it cannot escape the fate of allocation failure. Memory fragmentation often occurs in systems that use dynamic allocation.

Currently, the most effective way to address this issue is to use tools that display memory usage in the system to identify the culprits causing memory fragmentation, allowing for improvements in the relevant areas. Due to various issues with dynamic memory management, many companies in embedded applications simply disable malloc/free to eliminate potential problems.

Memory crashes are the most severe consequence of memory usage, primarily caused by array out-of-bounds access, writing to already freed memory, pointer calculation errors, and accessing stack addresses out of bounds. Such memory crashes cause random system failures and are difficult to trace, with few tools available for troubleshooting.

In summary, if you are going to use a memory management unit, you must be cautious and strictly adhere to its usage rules, such as who allocates memory must also free it.

3

<strong>Deeply Understand Code Optimization</strong>

When it comes to system stability, people often think more about real-time performance and speed, as code efficiency is crucial for embedded systems. Knowing how to optimize code is a skill that every embedded software developer must possess. Just like a girl wanting to lose weight, she must at least know which areas need to be reduced to buy the right weight loss products or equipment.

Thus, the premise of code optimization is to identify the areas that truly need optimization and then target those areas for improvement. The profiler mentioned earlier (a performance analysis tool that some feature-rich IDEs provide as a built-in tool) can record various situations, such as CPU usage rates for different tasks, whether task priorities are appropriately assigned, how many times a certain data has been copied, how many times the disk has been accessed, whether network send/receive programs have been called, and whether test code has been disabled, etc.

However, profiler tools still have limitations in analyzing real-time system performance. On one hand, people often use profiler tools when the system has already encountered issues, such as CPU exhaustion, and the profiler itself consumes a significant amount of CPU time, making it likely ineffective in such situations. According to the Heisenberg effect, any testing method will somewhat alter system operation, and this applies to profilers as well!

In summary, improving operational efficiency requires knowing exactly what the CPU is doing and how it is performing.

4

<strong>Don't Let Yourself Search for a Needle in a Haystack</strong>

Searching for a needle in a haystack is a vivid metaphor for debugging. I often hear people in the group exclaiming "shit!" about the code they are debugging. This is understandable because the code is not theirs, and they have every reason to criticize the buggy code, as long as they don't write such code themselves; otherwise, one day, others in the group might similarly criticize their code. Why is there a needle in a haystack? Surely someone dropped it there; why did it fall into the haystack? It must be due to someone's carelessness or negligence.

So when you complain about how hard it is to find the needle, have you considered that it might be you who carelessly dropped it? Similarly, when you are debugging to the point of exhaustion, have you thought about reflecting on whether you might have cut corners by not strictly adhering to good coding design standards, not checking the validity of certain assumptions or algorithms, or not marking potentially problematic code?

For guidance on writing high-quality code, refer to Lin Rui's "High-Quality C++/C Programming Guide" or the "0x8 Scriptures on C."

If you have indeed dropped the needle in the haystack, to prevent pricking yourself before finding it, you must take some precautions, such as wearing safety gloves. Similarly, to expose and capture the root of problems as much as possible, we can design comprehensive error tracking code. How to do this?

Try to handle failures for every function call, check the validity of every parameter input and output, including pointers, and verify whether a certain process has been called too many or too few times. Error tracking can help you identify approximately where you dropped the needle.

5

<strong>Reproduce and Isolate Problems</strong>

If you haven't dropped the needle in the ocean but rather in a haystack, it is easier to deal with. At least we can divide the haystack into many parts and search through them one by one. For large projects with independent modules, using isolation methods is often the last resort for dealing with deeply hidden bugs.

If the problem occurs intermittently, we need to find a way to reproduce it and document the entire process that leads to its reproduction so that we can use these conditions to reproduce the problem next time. If you are confident that you can reproduce the problem using the recorded conditions, then we can proceed to isolate the issue.

How to isolate it?

We can use <code>#ifdef to disable some code that may not be related to the problem, minimizing the system to the point where the problem can still be reproduced. If the problem still cannot be located, it may be necessary to open the "toolbox." You can try using ICE or data monitors to observe changes in a suspicious variable; you can use tracing tools to obtain function call information, including parameter passing; and check for memory crashes and stack overflow issues.

6

Retreat to Advance

To avoid getting lost in the forest, hunters often leave marks on trees so that if they ever get lost, they can find their way back using these markers. Tracking changes made to past code can be very helpful for debugging when issues arise in the future.

If one day, the program you modified recently crashes after running for a long time, your first reaction will be to wonder what changes you made, as the previous version was working fine. So how do you detect the changes made in this version compared to the last? The code control system SCS, also known as version control system VCS, can effectively solve this problem.

After checking in the previous version, compare it with the current test version. The comparison tools can be the built-in diff tools of SCS/VCS/CVS or other more powerful comparison tools like Beyond Compare and ExamDiff. By comparing, you can record all changed code and analyze all potentially problematic code.

7

Ensure the Completeness of Testing

How do you know how comprehensive your testing is? Coverage testing can answer this question. Coverage testing tools can tell you which code the CPU has executed. Good coverage tools can usually indicate that about 20% to 40% of the code is problem-free, while the rest may contain bugs. Coverage tools have different testing levels, and users can choose a level based on their needs.

Even if you are confident that your unit tests are comprehensive and there is no dead code, coverage tools can still point out some potential issues.

Consider the following code:
if(i >= 0 && (almostAlwaysZero == 0 || (last = i)))
If almostAlwaysZero is non-zero, the assignment last = i will be skipped, which may not be what you expect.

Such issues can be easily discovered through the conditional testing functionality of coverage tools. In summary, coverage testing is very helpful for improving code quality.

8

Improving Code Quality Means Saving Time

Research shows that over 80% of software development time is spent on the following aspects: debugging one's own code (unit testing), debugging one's own and other related code (module testing), and debugging the entire system (system testing). Worse yet, you may spend 10 to 200 times longer finding a bug that could have been easily identified at the start.

A small bug can cost you a lot, even if it does not significantly impact the overall system performance; it may affect the visible parts of the system. Therefore, we must cultivate good coding and testing practices to achieve higher code quality, thereby shortening debugging time.

9

Discover It, Analyze It, Solve It

There is no universal remedy in this world. No matter how powerful the profiler is, it can sometimes be powerless; no matter how good the memory monitor is, there are times it cannot detect issues; and no matter how useful the coverage tool is, there are areas it cannot cover.

Some deeply hidden problems may not be traceable even with all tools at your disposal. In such cases, what we can do is to discover patterns or anomalies through the external manifestations of these problems or some data outputs. Once any anomaly is detected, it is essential to deeply understand and trace its root cause until it is resolved.

10

Utilize Beginner's Mindset

It has been said that "some things may have various situations in a beginner's mind, but in an expert's mind, they may be quite singular." Sometimes, simple problems are overcomplicated, and simple systems are designed too complexly due to your "expert mindset." When you are stumped by a problem, turn off the computer, go for a walk, and discuss your problem with your friends or even your dog; they might provide unexpected insights.

11

Conclusion

Embedded debugging is also an art. Like other arts, to succeed, you must possess wisdom, experience, and know how to use tools. As long as we can thoroughly grasp these ten tips from Oracle, I believe we can achieve success in embedded testing.

โ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€ง  END  โ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€งโ€ง

Follow my WeChat public account, reply "planet" to join the knowledge planet, and get answers to your questions.

Click "Read the original text" to view details about the knowledge planet. Feel free to share, bookmark, like, and view.

Leave a Comment