Fault Injection: An Effective Method for Testing Observability Maturity

As technologies like cloud-native and microservices bring competitiveness to enterprises, they also make systems more complex. Increasingly complex systems make it difficult to identify the root causes of failures, leading to most of the time spent on locating issues. Being able to clearly understand what is happening in the system is one of the prerequisites for problem localization. Therefore, how to monitor the system and how to obtain the operational status of large-scale systems have become new challenges, which in turn promote the development of the observability field.

The Goal of Observability

Fault Injection: An Effective Method for Testing Observability Maturity

For many mature enterprises, many have already built monitoring systems like APM and NPM, as well as Trace and Log analysis systems. However, some newly established enterprises may still be in the early stages of observability construction.

So, do the requirements for observability differ for enterprises and technical teams at different stages?

Overall, observability represents the current ability to gain insights into the system. The higher the maturity of observability, the deeper and more complete the insights into the system. That is, the higher the maturity of the system’s observability, the more quickly and accurately the root cause of discovered issues can be identified. Therefore, regardless of the current stage of the enterprise or the current level of observability capability construction, the goals for building observability capabilities should be consistent.

The specific goals include:

More comprehensive data collection
More effective correlation of various types of data
Faster and automated confirmation of root causes of issues

The differences in the maturity of observability construction among various enterprises mainly manifest in the differences in the degree of achievement of these goals.

Observability Maturity

To better assist and guide enterprises in building observability, measuring, and evaluating their current level of observability construction, many organizations and companies have released definitions of observability maturity models. This article uses the “2023 Observability Maturity Model White Paper” jointly published by the Longxia Community and the Stability Assurance Laboratory of the China Academy of Information and Communications Technology as an example for explanation. This model is a framework or method used to measure and evaluate the internal observability of enterprise software systems, as well as a framework or method used to feedback the maturity level of the enterprise’s observability system construction.

This model includes five levels:

Level 1: Monitoring. Determine whether system components are functioning normally as expected.
Level 2: Basic Observability. Determine why the system is not working.
Level 3: Causal Observability. Find the fundamental cause of the problem and determine its impact, preventing it from happening again.
Level 4: Proactive Observability. Automate the identification of fundamental causes of problems, automate response handling, and intelligently predict and prevent abnormal risks from developing into failures.
Level 5: Business Observability. Determine the impact on the business, how to reduce costs, increase business revenue, improve conversion rates, and assist business decision-making.

The higher the maturity of observability construction, the more teams can automatically discover and fix problems through appropriate data, even proactively identify and prevent issues. It can be simply understood that the more failures can be discovered by observability tools, even proactively prevented, the higher the maturity. If many problems are still reported through customer service or other channels, it indicates that the maturity is not sufficient.

Using Fault Injection to Test Observability Maturity

What is Fault Injection

Chaos engineering is a methodology, and the core of chaos engineering is injecting faults. Simply put, starting from the application, predictable faults are injected into the application system under various environments and conditions to verify the application’s service quality and stability when facing various faults.

Fault Injection as an Effective Standard for Measuring Observability Construction Quality

The most direct way to measure the maturity and quality of observability construction in a production environment is to assess how many faults are discovered or even prevented by observability tools.

This is the most intuitive standard. If a lot of effort, resources, and manpower have been invested in building a comprehensive observability system, but a large number of faults are still not observable, or P0-level faults still occur, no one can agree that this system’s construction is mature or high-quality; it is merely a pile-up of observability data and tools.

Fault injection, as a simulation of real faults, is the closest to real scenarios and can most effectively evaluate the system’s response and recovery capabilities when facing actual faults. It can also effectively reflect whether the observability system can truly and effectively play its role in actual problem scenarios, providing the most practical value for problem-solving. Leading companies in the industry often use fault injection drills to test the robustness of their systems, continuously improving the ratio of problem discovery and prevention through observability tools.

Although fault injection cannot cover all fault issues, current mainstream tools can simulate most common network, system, code, and container problems, effectively helping organizations evaluate, improve, and develop their observability capabilities. Kindling-OriginX also uses this method during product design and development to test capabilities and iterate products.

Conclusion

If you want to test your observability capabilities, you can also deploy soma-chaos in the target environment similar to Kindling-OriginX Demo.

The types of faults currently supported by soma-chaos include:

Network Fault Cases. For example, high packet loss rate, high retransmission rate, bandwidth saturation, DNS faults, high TCP connection delay
Storage Fault Cases. For example, high IO delay
CPU Fault Cases. High CPU usage by the code itself, CPU contention from other processes in a shared environment
Memory Fault Cases. High frequency of FULL GC, memory contention from other processes in a shared environment
Code Fault Cases. Code throwing exceptions leading to error codes being returned, HTTP requests returning error codes

soma-chaos is an open-source fault simulation case collection system. This project is open-sourced under the Longxia Community System Operations Alliance, including the business simulation system Train-Ticket open-sourced by Fudan University SELab, the Chaos-Mesh open-source cloud-native chaos engineering platform, and a collection of real fault cases. Any organization or individual is welcome to contribute fault cases and discuss fault injection practices or any ideas and questions that arise during use.

END

About Kindling-OriginX

Kindling-OriginX, through its advanced technologies such as eBPF and TraceProfiling, can not only solve system-level faults like network or storage issues but can also effectively handle application-level faults. Let’s take a closer look at its fault localization capabilities at different levels:

System-Level Faults:Kindling-OriginX utilizes eBPF technology to access and analyze kernel-level metrics. This is crucial for diagnosing system-level issues such as network or storage problems, which are often not captured by traditional monitoring systems. By providing deep insights into kernel behavior, Kindling-OriginX is well-suited for identifying and resolving such faults.

Application-Level Faults:The TraceProfiling technology in Kindling-OriginX is particularly suitable for troubleshooting application-level faults. It can accurately capture every call in the application, restoring the execution process of user requests completely by linking thread execution with the tracing system. This function is crucial for diagnosing application-level issues, such as specific request problems, performance bottlenecks, or code execution errors. By associating thread execution details with application behavior, Kindling-OriginX can effectively pinpoint application-level issues.

Integration of Metrics and Logs:Kindling-OriginX aggregates and analyzes data from various sources (including metrics and logs), enhancing its ability to resolve application-level faults. This comprehensive perspective allows for more thorough analysis, enabling it to identify complex issues across systems and applications.

Kindling-OriginX is not limited to solving system-level faults. It employs advanced approaches, utilizing eBPF for deep system insights and TraceProfiling for detailed application-level analysis, enabling it to effectively identify and resolve both system and application-level issues. By integrating various data types, it further supports its capability to handle a wide range of faults, making it a versatile tool for fault diagnosis and resolution in complex IT environments.

Fault Injection: An Effective Method for Testing Observability Maturity

Click“Read Original” to learn more about fault injection and Kindling-OriginX

Related posts

Leave a Comment Cancel reply