Follow+Star Public Account, Don’t miss exciting content

Why Custom Embedded Systems Are Less Stable Than Industrial Products

Reprinted | Embedded Guest House

I saw a question from a netizen on Zhihu:Why is the stability of custom-designed embedded systems far less than that of industrial products?

I think this is a very good question, especially since many small and medium-sized enterprises in China may focus more on product functionality rather than performance. The product functionality is flashy, but stability is often lacking. So let’s talk about my understanding and experience regarding product stability.

What is Stability?

Stability, in English, is studied under (Reliability) Reliability engineering, which is a sub-discipline of systems engineering that primarily studies the ability of equipment to operate without failure. Reliability describes the ability of a system or component to perform its required functions under stated conditions for a specified period of time.

A product generally includes several main designs from a design perspective. Common embedded systems may include mechanical design, hardware design, and software design. Therefore, to discuss the reliability of a product, one must focus on these main aspects. Additionally, why do industrial-grade products perform more reliably?

To delve deeper into this issue, let’s take a look at some relevant terminology and metrics that describe the reliability of a system or component.

What is Reliability Probability?

Reliability is defined as the probability that a device will perform its intended function under specified conditions for a specified period of time. Mathematically, this can be represented as the reliability of a system being the probability of no occurrence of a failure of type F within time t.

How to understand this? The time the system starts working is the moment when type failures may occur. This characterizes the probability of system failure due to type failures.

What is Failure Probability?

With reliability probability, there naturally exists failure probability, which is complementary to reliability probability, satisfying the following relationship:

The above concept is based on statistical laws for certain types of failures. When generalized, removing the subscript F from the system also satisfies the above relationship. A system is composed of different components.

System Failure Rate

When the lifespan of a system follows an exponential distribution, the reliability of the system is: , where the concept of failure rate is defined.

Why Custom Embedded Systems Are Less Stable Than Industrial Products

This curve should be familiar to many; it is the bathtub curve of products. In the early aging phase of a product, failures are more likely to occur, and the failure rate is relatively high. Don’t ask me why; this is a statistical law derived from countless predecessors. For engineering applications, just trust it (of course, theoretical research is another matter). This law also explains why some manufacturers require products to undergo aging tests before leaving the factory, as aging tests can fully expose failures, and thus the products screened out are less likely to fail at the client side.

Why spend so much time discussing failure rates? Let’s look at this example table:

Automotive Embedded System Components

Failure Rate

Military-grade Microprocessor	0.022
Automotive-grade Microprocessor	0.12
Electric Motor	16.9

The failure rates of different components vary. Therefore, at the design stage, within the constraints of cost, it is essential to choose components with lower failure rates as much as possible.

Failure Modes

The failure modes of different components are not the same. What does this mean? It means that the reasons for component failures vary. For example, resistors on a circuit board may fail due to short circuit, open circuit, or parameter drift. For software, there can be many failure modes, such as stack overflow, RAM data errors, chip bus errors, etc. Each failure mode has a different failure rate. To understand these metrics more deeply, one can refer to IEC61508 or other equivalent standards.

How to Improve Product Reliability?

So how can we improve product reliability? I think we can generally approach it from these aspects:

Successful Development Process
Successful Project Management
Strict Quality Control

Development Process

This is defined by IEC61508, where software and hardware must adopt the following V&V development model (note that the diagram in IEC61508 is somewhat different). Let’s briefly introduce this model:

Many development processes are currently popular, such as agile development models, which are quite favored. Personally, I do not oppose agile development models, but from the perspective of product reliability, I prefer this double V model. In fact, many agile projects can incorporate this double V model into each iteration. This model requires that every step down from design requirements > architecture > detailed sub-module design requires verification testing of the previous level while also ultimately producing validation to confirm the design.

This approach is reflected in standards such as IEC 61508 (Functional Safety Level Standard). Let’s briefly look at related standards:

IEC 61508 is a basic functional safety standard applicable to various industries. It defines functional safety as: “A part of the overall safety related to the EUC (Controlled Device) and the EUC control system, which depends on the correct operation of E/E/PE safety-related systems, other technology safety-related systems, and external risk reduction facilities.” The basic concept is that any safety-related system must operate correctly or fail in a predictable (safe) manner.

My understanding of the V&V model includes several key points:

It must be an executable process, not a formalized routine!
The scientifically rigorous routines specified by standards must be followed, ensuring bi-directional traceability from requirements to design and from design to testing.
Reliability must be introduced at the very beginning when transforming market demands; small and medium-sized enterprises often focus more on functional realization while neglecting performance and reliability requirements. Note that these reliability requirements are design goals; without goals, how can one produce a quality product from a systematic perspective? Without following a rigorous systematic process, a company may occasionally produce a high-quality product, but it is believed to be very difficult to guarantee the overall quality of the products produced.
Requirement Phase: Reliability needs to be proposed in requirements. Here are some examples:

Environmental requirements, such as temperature, humidity, vibration, etc., should be defined in the requirements phase, specifying the relevant testing levels. Even if a company’s product does not require mandatory certification, from a reliability perspective, raising such requirements will undoubtedly necessitate corresponding designs and tests to ensure. Thus, these dimensions will enhance product reliability.
EMC requirements: For instance, conventional anti-jamming test requirements must be clearly proposed in the requirements phase.
…..

Design Phase: Reliability requires demands to be proposed and needs to be realized through design. For every reliability requirement, it can actually be transformed into design check items for traceability. For example, in hardware, there is the device failure model and impact analysis FMEA (Failure Mode and Effects Analysis); similarly, there are corresponding failure models in software for analysis and impact analysis. Through failure model analysis, one can refer to recommended practices in standards to take measures from a design perspective.
Testing Phase: Testing includes verification and validation. In the V&V model, every step down should verify the previous step. Each design output should have corresponding validation to ensure, and the final product level still requires a confirmation phase. Another important point is that in industrial-grade product development, the design has already intervened in the requirements phase.

Every company has its own development process, but from the perspective of product lifecycle and the inherent laws of product reliability, I believe these key points should be considered. There is a lot of content regarding development processes and functional safety standards, and here I summarize these personal insights.

Successful Project Management

Project management is the process of leading a team to achieve specific goals and meet success criteria within a specified timeframe. The main challenge of project management is to achieve all project objectives within given constraints. This information is usually described in project documentation created at the beginning of the development process. The primary constraints are scope, time, quality, and budget. The secondary challenge is optimizing the allocation of necessary inputs and applying them to meet predetermined objectives.

The development of a highly reliable product is closely tied to effective project management. Project management plays a crucial role throughout the execution of the project, managing the project from initiation, planning, execution, monitoring, to closure.

The knowledge required and the activities to be undertaken at each stage of project management can refer to the following diagram (source: PMBOK 6th Edition):

Strict Quality Control

Good design release without strict production quality control will certainly not guarantee the quality of the products sold. I won’t elaborate on this point as I don’t have much understanding of it.

Summary

Returning to the question itself, to obtain a reliable product, I believe that there should be comprehensive requirements and good implementation in product design processes, project management, and production quality control. The significant investment in certifications for industrial-grade products is not without reason. Many of these certifications are specific assessments of reliability.

———— END ————

Recommended Reading:

Selected Tutorials in Embedded Column

Selected Summary | STM32, Microcontroller

Selected Summary | RTOS, Operating Systems

Welcome to follow my public account, reply “Join Group” to join the technical exchange group according to the rules, reply “1024” for more content.

Welcome to follow my video account:

Click “Read the original” for more shares, and feel free to share, collect, like, and view.