

This article is sourced from: Automotive Electronics and Software (ID: aestech)

Introduction to Fault Injection Methods
In key scenarios related to functional safety, intensive testing activities are crucial to ensure that new systems and built-in fault tolerance mechanisms operate as expected. Ensuring that the system operates normally in the event of a failure (Fail Operational) is a more complex issue than traditional testing content. The process of introducing faults into the system to evaluate its behavior and measure the efficiency of fault tolerance mechanisms (i.e., coverage and latency) is called fault injection.
The development of fault injection methods has progressed in tandem with digitalization.
Initially, digital systems only utilized simple hardware systems. Therefore, the first fault injection methods involved injecting physical faults into the target system hardware by assuming simple hardware fault models (such as bit flips or bit freezes) (e.g., using radiation, pin levels, power interference, etc.).
The increasing complexity of hardware has made the use of these physical methods quite difficult, if not impossible, leading to the popularity of a new fault injection method based on runtime simulation of hardware faults through software (Software-Implemented Fault Injection, SWIFI).
As critical systems expand into other application areas, we see the software portions of these systems becoming increasingly complex, which has become a significant cause of system failures. An example is the maiden flight of the Ariane 5 rocket (June 4, 1996). During the test flight, the rocket deviated from its flight path and exploded less than a minute after launch, resulting in a loss of $500 million. The explosion was caused by erroneous data conversion in the software, from a 64-bit floating point to a 16-bit signed integer representation. This vulnerability stemmed from the reuse of software subsystems from previous missions without substantial retesting, as developers believed the task was compatible with the new system.
SWIFI tools are used to inject errors into program states (e.g., data and address registers, stack and heap memory) and program code (e.g., memory areas storing code before or during program execution). Unfortunately, in complex software-intensive systems, SWIFI cannot accurately simulate the effects of real software faults. This is because the number of lines of code used in automobiles has exponentially increased from tens of thousands to hundreds of millions over the past thirty years.
Compared to the first fault injection method, using fault injection to simulate the effects of real software faults (i.e., bugs), known as Software Fault Injection (SFI), is a relatively new approach. In fact, software fault injection involves introducing small changes into the target program code, creating different versions of the program (each version with one injected software fault).
The ISO 26262 standard specifies the use of error detection and handling mechanisms in software, as well as verification through fault injection.
Software fault injection is a hypothetical experiment that can originate at any stage of the software development process, including requirements analysis, design, and coding activities. The goal is to execute the target under a given workload and insert faults into specific software components of the target system. The main objective is to observe the behavior of the system in the presence of injected faults, considering that these faults reproduce reasonable failures that may affect the given software components of the system during runtime.
Key Characteristics of Fault Injection Methods
Faults are the causes of incorrect system state determinations or assumptions, referred to as errors. A fault occurs when erroneous services are provided, that is, when a user or external system perceives an erroneous state.
The accuracy of the results obtained from fault injection activities largely depends on several key characteristics of the experiment, namely:
This refers to the ability of the fault load and workload to represent the real faults and inputs that the system will experience during operation. Representativeness of faults can be achieved by defining a real fault model and accurately reproducing that fault model during the experiment.
This requires that the instruments used during the fault injection process (such as fault insertion and data collection) should not significantly alter the actual behavior of the system. For example, executing additional code to corrupt the software state may lead to intrusiveness.
This refers to the characteristic that ensures statistically equivalent results when executing fault injection activities multiple times using the same program in the same environment. Achieving this characteristic is not easy due to the many sources of uncertainty in computer systems, such as thread scheduling and event timing.
This refers to the effectiveness of fault injection in terms of cost and time. These factors include the time required to implement and set up the fault injection environment, the time to execute experiments, and the time to analyze results. This attribute requires experiments to be supported by automated tools to meet time and budget constraints.
This requires that fault injection techniques or tools can be easily applied to different systems for comparison. The portability of fault injection tools also refers to the ability of the tool to support multiple fault models and be extended with new fault models.
Characteristics of Software Faults
Injecting software faults requires precise definitions of the faults to be injected, which in turn requires a clear understanding and description of software faults. This is not easy to achieve, as software faults are caused by human errors that occur during development, which affect software artifacts in the form of erroneous instructions in the program.
To improve software reliability, several fault classification models have been proposed. Among these, Orthogonal Defect Classification (ODC) is one of the most widely adopted models by researchers and practitioners and has been used in multiple studies to define fault models for software fault injection.
ODC is a framework for classifying software faults, aimed at obtaining metrics and quantitative feedback from the software development process;
Software Fault Injection Techniques
Many SFI techniques and tools have been developed over a period of more than 20 years. Here, we illustrate and discuss this work by distinguishing two basic approaches: injecting fault effects (also known as error injection), where errors are introduced by perturbing the system state, and injecting actual faults, where changes are made to the program code to simulate software faults within the code. The following subsections review software fault injection techniques:
· Data error injection methods, which were the earliest methods based on the hardware fault injection techniques that existed at the time;
· Interface error injection methods, aimed at testing the robustness of components interacting with other components;
· Injecting actual faults, which introduces small fault changes into the program code.
The early methods of injecting fault effects were developed in the context of studying hardware faults through SWIFI. SWIFI aims to reproduce the effects of hardware faults (such as CPU, bus, and memory faults) by interfering with the state of memory or hardware registers (i.e., errors). According to the following criteria, the SWIFI method replaces the contents of memory locations or registers with corrupted values:
· What to inject. The content of a single bit, byte, or word in a memory location or register has been corrupted. The types of errors are defined based on the analysis of errors caused by electrical or gate-level faults. Common types of errors include replacing bits with fixed values (stuck-at-0 and stuck-at-1 faults) or inverted values (bit flips).
· Where to inject. Due to the large number of memory locations, errors injected into memory typically target a subset of locations. Injection can focus on randomly selected locations within specific memory areas (e.g., stack, heap, global data) or user-selected locations (e.g., specific variables in memory). Errors injected into registers can target those that are accessible via software (e.g., data and address registers).
· When to inject. Error injection may be time or event-related. In the former case, errors are injected after a given experimental time, which is selected by the user or based on a probability distribution. In the latter case, errors are injected when a specific event occurs during execution, such as during the first access or every access to the target location. Three types of hardware faults can be simulated: transient faults (i.e., occasional faults), intermittent faults (i.e., repeated faults), and permanent faults.
It is worth noting that hardware errors injected by SWIFI tools can be injected into program states (e.g., data and address registers, stack and heap memory) and program code (e.g., memory areas storing code before or during program execution). This is an important distinction from software fault injection: corruption in the program state aims to reflect the effects of software faults, i.e., errors caused by executing erroneous programs, such as erroneous pointers, flags, or control flow, which SWIFI tools can directly introduce; conversely, faults in the program code aim to reflect actual software faults within the code.
4.2 Interface Error Injection
Injecting errors at input parameters aims to simulate the effects of external faults generated outside the target, including the impacts of software faults in external software components, and to evaluate the target’s ability to detect and handle corrupted inputs. Similarly, the corruption of output values is used to simulate the outputs of faulty components and can be used to assess the impact of faults on the rest of the system.
Faults in input parameters may reveal defects in the design and implementation of the target’s error detection and recovery mechanisms (e.g., input processing code). It is often employed in robustness testing, which evaluates the extent to which a system or component can operate correctly in the presence of invalid inputs or stressful environmental conditions. It should be noted that the objectives of robustness testing and interface error injection differ from functional testing techniques, such as black-box testing: robustness testing aims to assess the robust behavior of software modules in the face of invalid inputs (e.g., avoiding process crashes or generating warning signals), which is unrelated to the functional correctness of the target.
Interface error injection can be performed in two ways. The first method is based on test drivers linked to the target component (e.g., programs using APIs exported by the target) and executes the program by submitting invalid inputs. This method is similar to unit testing, but in this case, robustness, rather than functional correctness, is evaluated. The second method involves intercepting and corrupting the interactions between the target and the rest of the system, that is, triggering an interceptor program when calling the target component and modifying the original input to introduce corrupted input. In this case, the target component is tested in the context of the entire system integrating the target. This method is similar to SWIFI, as the original data flowing through the system (in this case, interface inputs) is replaced by corrupted data.
In interface error injection experiments, among several input parameters and several calls to the target API occurring during the experiment, typically only one input parameter and one call are corrupted. Common methods for generating invalid input values include three approaches:
· Fuzzing: The original value is replaced with a randomly generated value.
· Bit Flipping: The corrupted value is generated by flipping one or more bits of the original value.
· Data Type-Based Injection: The original value is replaced with an invalid value selected based on the type of the corrupted input parameter, where the type is derived from the API exported by the target. This method defines a pool of invalid values for each data type, selected from analysis of the type domain (e.g., “NULL” in the case of C pointers).
4.3 Injecting Code Changes
The previous subsections primarily discussed simulating software faults by injecting fault effects (i.e., errors) using the SWIFI method. A public issue with these methods is the representativeness of injected errors (such as bit flips), which may not necessarily match the errors generated by software faults.
To address the representativeness issue, recent research on SFI has focused on injecting errors into program code (i.e., code changes). Injecting code changes can simulate real software faults, as the injected faults produce errors and failures similar to those generated by real software faults. Generally, errors can be injected into the code storage area of the process or the binary executable file using SWIFI. However, it should be noted that thorough testing of these programs requires injecting software faults within a limited scope, necessitating specialized tools and techniques specifically for software fault injection.
When selecting methods for the system, the characteristics of the discussed fault injection methods should be considered.
Error injection is typically used to evaluate the robustness of individual components and improve error handling in specific parts of the code. The main reason is that error injection allows experimentation on specific parts of the system, as it can assess the impact of errors on specific component interfaces or program variables. In fact, error injection does not require waiting for errors to be generated and propagated to specific parts of the program state being evaluated. Moreover, since error injection can be applied to individual components, it can be executed early in the software verification process.
Conversely, injecting code changes aims to quantitatively evaluate and compare fault-tolerant systems as a whole across alternative design choices. Code changes are better suited for these objectives because they are based on representative models of software faults and closely simulate the behavior of faulty software. This is an important requirement for quantitative evaluation and comparison, as it considers the relative probabilities of fault occurrences to reflect the behavior exhibited by the system during operation. This makes injecting code changes more suitable for the later stages of software verification, when system components have already been integrated, and the goal of developers is to assess the expected fault tolerance of the system during its operational lifespan (and derived metrics such as availability).
– End –






Disclaimer:
Any works indicated as “Source: XXX (non-Zhiche Technology)” in this public account are reprinted from other media, with the purpose of conveying and sharing more information, and do not represent this platform’s endorsement of their views or responsibility for their authenticity. Copyright belongs to the original author, and if there is any infringement, please contact us for deletion.