Understanding Reliability Design in Embedded System Software

Since the birth of embedded systems over 40 years ago, with the development of technology and changes in demand, embedded system software has become increasingly important within embedded systems. Nowadays, some embedded system hardware can be identical, but different software makes them different products (like switches and routers).

The application fields of embedded systems are diverse, and their requirements and focuses vary (for example, industrial control emphasizes reliability), but the basic requirements for embedded systems are powerful functionality, stable performance, and reliable operation. However, these three points are not mutually supportive but are contradictory.

The functionality, stability, and reliability of embedded systems are related to both the hardware and software of the embedded system. This article only discusses the reliability design issues of embedded system software, thus assuming that the hardware of the embedded system is stable and reliable. Although some applications can achieve reliable products through software design on unreliable hardware (for example, USB drives, NAND FLASH is an unreliable storage medium, but through software design, reliable storage devices can be obtained, and hard drives are even more so), this is not within the scope of this article.

The Relationship Between Reliability and Stability

Law 1: Simpler things are easier to make reliable

Compared to a hammer, a mechanical watch is complex enough. If both a hammer and a mechanical watch were dropped from a height of 10 stories onto ordinary concrete, which one is more likely to be damaged?

Of course, if a high cost is incurred, such as using the best materials and adding or removing a shock absorption system, a mechanical watch can even withstand the drop of a hammer without breaking. Don’t believe it? There are many pilots who have fallen from tens of thousands of meters without injury (of course, they have parachutes).

From the above explanation, it can be seen that simple things can be made highly reliable easily, but making complex things highly reliable incurs much higher costs. This is a general principle that also applies to embedded software. If this is the case, then why do people still want to make complex things? This leads to the second law.

Law 2: More complex things are easier to make stable

I remember when I first entered college, we had military training, and the last item was shooting. Our class was ordered to clean the semi-automatic rifles used for shooting the day before. I don’t remember the specific model, but it was definitely produced in the early days after the founding of China. Before cleaning, the instructor told us to pay attention to a point: “One person cleans one gun, don’t mix up the parts, otherwise, it won’t be assembled.” This means that the same model of two guns cannot interchange the same parts! This is because the guns produced in the early days after the founding of the country were made with simple tools, and the dimensions and quality of the parts were unstable, and some parts of a gun had small tolerance requirements, so they had to manually select compatible parts to assemble into a finished product. Thus, due to the instability of the product parts, the parts of the same model could not be interchangeable. Now look at some modern firearms, it is normal for 60% of the parts of different models to be interchangeable, which is due to design reasons and also thanks to the precision and complexity of manufacturing tools, which can produce parts with stable dimensions and quality.

Embedded system software is the same. The more complex our code becomes, the more we aim for the software to run stably under various conditions.

Law 3: Every system has a minimum complexity

A regular hammer must have a handle and a hammerhead, the simplest handle is probably a cylinder, and the hammerhead is the same. It seems that the simplest hammer is composed of two cylinders and I cannot imagine a simpler hammer. It is easy to make a hammer more complex, with many methods, such as casting dragons and phoenixes on the hammer.

This means that under the premise of the same functionality and stability, every system has a minimum complexity. The function of a hammer is to hit things; just for that function, only a perpendicular body is needed, but that can easily hurt a person’s hand (poor stability), so a hammer handle is needed. The same applies to embedded system software.

Conclusion

From the above three laws, it can be seen that there is a certain contradiction between system stability and reliability: the easiest way to improve stability is to reduce system complexity, which often reduces system stability. Similarly, increasing system stability can easily reduce system reliability, and achieving high stability and reliability simultaneously requires a considerable cost.

Relationship Between Functionality, Reliability, and Stability

From the above, it can be seen that the functionality of the system is not isolated but is interrelated and mutually restrictive with reliability and stability, which will be analyzed in detail below.

Law 1: The increase in functionality relies on the increase in complexity

Everyone knows that a regular hammer can only hammer things; now we need to add the function of pulling nails, which requires changing the shape of one end of the hammerhead, making it evidently more difficult to manufacture (complexity increases) . The hammer’s functionality has increased, but it is also more difficult to use and more prone to damage (if the hammer is flipped and the nail-pulling end is used to hammer something…).

As complexity increases, it requires more cost to maintain the same reliability, clearly indicating that functionality and reliability are also contradictory.

Law 2: The increase in functionality may reduce the complexity of individual functions

You can find the best camera phone currently available on the market and compare it with an ordinary digital camera regarding their photo-taking effects. It can be assured that the digital camera’s effect is better. The reason is that the camera phone, due to various limitations, cannot make its integrated digital camera function as complex as an ordinary digital camera (the lens is not precise enough, the flash can only use LED or low-grade neon lights, the photosensitive element can only be simple), and of course, the stability will be worse. The same goes for embedded software, which often can only simplify the code for various functions to integrate them together due to restrictions on storage space, human-machine interfaces, etc.

Understanding Reliability Design in Embedded System Software

As complexity decreases, it requires more cost to maintain the same stability, so ensuring the same stability may even be an impossible task, clearly indicating that functionality and stability are also contradictory.

Conclusion

From the above two laws, it can be seen that there is a certain contradiction between the functionality of the system and its stability and reliability. To have both extensive functionality and stable reliability requires considerable costs.

Effective Methods to Increase the Reliability and Stability of Embedded System Software

Optimizing System Framework Design Can Improve System Stability and Reliability

Based on a certain level of stability and reliability, a system theoretically has a minimum complexity, but in practice, achieving this minimum complexity is impossible. In actual work, often like carving flowers on a hammer, increasing complexity does not only fail to improve system stability, but if done poorly, it can even reduce system stability.

Increasing system complexity makes it more difficult to maintain existing reliability and does not help improve system reliability.

If one wants to improve system stability and reliability at a relatively low cost, a good method is to reduce unnecessary complexity. The factor that most affects system complexity is the system framework; a good system framework can suppress unnecessary increases in system complexity and minimize the impact on existing functional modules when system functionality changes.

Thus, the cost of improving system stability and reliability is lower, indirectly enhancing system stability and reliability.

Stability and Reliability Come from Strict Testing

Humans can never fully understand the world, so it is impossible to consider all situations when designing a system. Therefore, stability and reliability cannot just be spoken of; they cannot be determined solely through analysis of system design.

The second step to improving stability comes from strict testing, including initial testing by designers and later third-party testing. If problems are found during testing, the design must be modified and retested. This process continues until no problems are found during testing within a certain period.

Stability and Reliability Depend on Time Validation

After a product undergoes rigorous internal testing and small-batch trial production and is provided to friendly customers for use (external testing), it finally hits the market in bulk. However, even so, world-class companies can still experience large-scale product recalls; why?

As mentioned earlier, humans can never fully understand the world, so even the strictest testing cannot simulate all situations in actual usage. Thus, when the environment and methods of user use differ from those of testing, potential instability or unreliability points in the product are exposed. If these instability or unreliability points are critical, the product must be recalled. If not critical, design improvements are also needed to enhance system stability and reliability. This process continues. If a system can be used extensively and for a long time without needing improvements, it indicates it is stable and reliable.

Stability and Reliability Come from Professionalism

In ancient times, if you made a hammer with a professional blacksmith, who would have stable and reliable output and quality? Clearly, it would be the blacksmith. Why?

Why does professionalism lead to stability and reliability?

The most important reason is that they have invested significant costs in this field to enhance the stability and reliability of the system (otherwise, they wouldn’t be professionals), and they are no longer on the same starting line as amateurs; it is impossible for amateurs to surpass professionals in a short time.

Secondly, they are very familiar with the situations in their field, and their testing methods align closely with actual conditions, enhancing stability and reliability.

Thirdly, they can use proven systems as a foundation for new systems, or even directly use old systems, limiting the uncontrollable increase in complexity, ensuring stability and reliability at a relatively low cost.

Conclusion: Professional Division of Labor is the Fastest and Most Economical Way to Improve Embedded System Software

With the development of technology and social progress, users now demand embedded systems to be powerful, stable, and reliable. A powerful and stable system has comparatively high complexity, but not all complexity negatively impacts system reliability significantly. A reliable module that has stood the test of time has a minimal negative impact on system reliability.

However, a powerful system often involves various knowledge areas, many of which may not fall within one’s professional scope. Developing it oneself to achieve reliability incurs significant costs, which may exceed the benefits. At this point, finding professional partners to provide stable and reliable modules to integrate into one’s system while focusing on one’s professional area is a good approach. This way, the negative impact of each component’s complexity on reliability is minimized, while overall complexity is also easier to control, allowing products to reach the market faster.

Embedded system software is more suited to this model. This is because software is easily replicable, and the reliability, stability, and complexity of the replicas do not change. Professional companies’ software modules are generally used by multiple companies in completely different environments, and their functionality, stability, and reliability have been rigorously tested, posing minimal negative impact on one’s system. Multiple companies using the same modules can also share the costs of software development, resulting in lower direct usage costs. Additionally, professional companies have a deep understanding of their fields, allowing them to assist users in development, further reducing user costs.

Thus, the professional division of labor is the fastest and most economical way to improve embedded system software. (This article is written by Chen Mingji, copyright belongs to Guangzhou Zhiyuan Electronics Co., Ltd.)

After reading the entire article, I just realized, where’s the benefit?

Here comes the benefit!

There is limited learning from one article, but this book contains more cases for your reference!

If this is what you need, quickly share this article to your friend circle and comment, “Those working with embedded systems can follow!” The first five friends will get this book for free!

Even without holidays, we are still so generous in giving away benefits!

Kind reminder: Leave us a message with your detailed address, and this book will be sent to you for free! (Name + Unit + Position + Detailed Address + Postal Code + Phone + Email)

The “Learning World” column is officially launched and is continuously updated, come and follow! Just reply with “Microcontroller”, “Embedded Systems”, “Internet of Things”, “Smart Hardware”, “Embedded Engineers”, “CPU”, “GPU”, “FPGA”, “Linux”, “Android”, “ARM” keywords to read past related exciting content anytime!

Disclaimer: This article is a network reprint, and copyright belongs to the original author. If there are copyright issues, please contact us, and we will confirm copyright and pay remuneration or delete content based on the copyright certificate materials you provide.

Leave a Comment Cancel reply