Xiaomi's Practice of Fault Injection Platform Based on Chaosblade

Source: Xiaomi Technology

Author: Li Qianming – Big Data SRE

Currently, domestic practices are more inclined towards fault testing, which involves conducting fault injection experiments in specific scenarios and verifying whether expectations are met. This type of testing has relatively controllable risks, but the downside is that it does not explore more scenarios through fault injection experiments, exposing more potential issues. This article shares Xiaomi’s practical case in fault injection and platform construction.

1. Background

Xiaomi's Practice of Fault Injection Platform Based on Chaosblade

In today’s society, internet applications are becoming increasingly widespread, and the number of users is growing rapidly. As people’s dependence on internet services increases, so do their demands for service availability and user experience. So how can we ensure that services consistently provide users with stable, uninterrupted, and reliable services during operation?

For example, if a financial product encounters an issue, it could lead to significant losses. Everyone knows that the system architecture and service logic of financial products are quite complex, which leads to the immediate thought of testing engineers. They validate service stability through unit testing, integration testing, performance testing, and more. However, even with these measures, it is still far from sufficient because errors can occur at any time and in any form, especially in distributed systems. Therefore, we need to introduce Chaos Engineering.

2. Introduction to Chaos Engineering

Chaos Engineering is a methodology that involves conducting experiments on system infrastructure to proactively identify vulnerabilities in the system. It was first proposed by Netflix and related teams. It aims to eliminate failures in their infancy, meaning identifying them before they cause interruptions. By actively creating failures, we can test how the system behaves under various pressures, identify and fix fault issues, and avoid severe consequences. In 2012, Netflix open-sourced Chaos Monkey. Today, many companies (including Google, Amazon, IBM, Nike, etc.) adopt some form of chaos engineering to enhance the reliability of modern architectures.

2.1 Differences Between Chaos Engineering and Fault Testing

Similarities: Both are introduced based on fault injection.

Differences:

Chaos Engineering is a practice that generates new information, whereas fault testing is a specific method to test a situation.
Fault testing implements injection experiments and verifies expectations in specific scenarios, while chaos engineering experiments validate around a “steady state” through more scenarios.
Chaos Engineering recommends conducting experiments in production environments.

Chaos Engineering is particularly suitable for revealing unknown weaknesses in production systems, but if it is determined that chaos engineering experiments will cause serious issues in the system, then running such experiments is meaningless.

Currently, domestic practices of relevant companies lean towards fault testing, implementing fault injection experiments in specific scenarios to verify whether expectations are met. This type of testing carries relatively controllable risks; however, it does not explore more scenarios through fault injection experiments, exposing more potential issues, and the testing outcomes are heavily reliant on the experience of the implementer. Of course, we believe this is a necessary stage on the road to achieving chaos engineering goals.

The expectations and verifications of fault testing differ from the steady state behavior defined by chaos engineering, and the fundamental reason for the difference still lies in the organizational forms. In 2014, the Netflix team created a new role called Chaos Engineer and began promoting it within the engineering community. However, most companies in China currently do not have a dedicated position for implementing chaos engineering, and the differences in project goals, business scenarios, personnel structure, and implementation methods lead to a non-standard definition of steady state behavior. Of course, we currently also do not have a Chaos Engineer. Therefore, based on different definition standards, different organizational forms, and different stages, our introduction and use of chaos engineering experiments are also more inclined towards fault testing.

3. Principles of Chaos Engineering

The following principles describe the ideal way to apply chaos engineering, and the degree to which these principles are matched can enhance our confidence in large-scale distributed systems.

3.1 Build Hypotheses Around Steady State Behavior

Focus on measurable output results of the system rather than internal attributes. Measure output results within a short time and use this as a representation of the system’s steady state. Overall system data throughput, error rate, latency percentage, etc., can all serve as key indicators of steady state behavior. By focusing on the various system behavior patterns during experiments, chaos engineering can verify whether the system is functioning normally—without needing to try to decipher how it works.

3.2 Diversify Real-World Events

Chaos variables should directly reflect various real-world events. We should consider various or related faults, and any event that could lead to a steady state disruption should be regarded as a potential variable in chaos experiments. Due to diversified businesses and highly complex system architectures, many types of faults may occur, and some have prioritized analyzing P1 and P2 faults, drawing fault portraits from the perspectives of IaaS, PaaS, and SaaS layers, as shown below:

3.3 Run Experiments in Production Environments

From the perspective of functional fault testing, implementing fault injection in non-production environments can meet expectations, so the earliest strong and weak dependency tests were completed in daily environments. However, because system behavior can vary based on environment and traffic patterns, to ensure the authenticity of system execution and its relevance to the currently deployed system, the recommended implementation method is still in production environments, but many companies have not yet achieved production environment experiments.

3.4 Continuously Automate Experiment Execution

Use fault injection platformization to replace manual experiments, automatically building orchestration and analysis.

3.5 Minimize Explosion Radius

Chaos engineering suggests conducting experiments in online environments, but there is a possibility of causing system anomalies or even exacerbating them. From the implementation perspective, it is best to effectively reduce the impact of anomalies through technical means, such as effectively stopping injections, limiting flows, switching traffic, data isolation, and other methods to reduce business impact based on system steady state indicators.

4. Building a Fault Injection Platform

4.1 Platform Objectives

Provide a diversified, visual operation automated fault injection platform.
Serve as a unified entry point for various drills and fault tests and verifications.
Accumulate and solidify various experimental plans, establishing a baseline for system robustness assessment.

4.2 Functional Objectives

Help the business discover more unknown issues affecting business stability.
Verify the effectiveness and completeness of business alerts.
Validate whether business contingency plans are effective.

4.3 Platform Construction

Currently, there are various tools in the industry for simulating faults, each with its own strengths and weaknesses in supported functions and scenarios. Through comparison, chaosblade supports a rich set of functions and scenarios, and the community is also quite active. After fully validating most injection functionalities, we chose it as the core module for underlying injection.

For the scenarios supported by chaosblade, please refer to the documentation: https://chaosblade-io.gitbook.io/chaosblade-help-zh-cn/

The entire fault injection is divided into four stages:

The platform construction approach is still centered around the four main processes. Categorically, the platform modules are illustrated in the following diagram:

Currently, we support fault injection scenarios at the system layer and network layer, mainly including network latency, packet loss, CPU overload, memory overload, and disk IO overload scenarios. These scenarios are all hot-pluggable integrated into the platform, allowing flexible adjustments and expansions of platform scenarios.

5. Practical Drills

For the practical drills, we selected a relatively stable business A (business A is a multi-node task scheduling and target acquisition service) and conducted experiments in the testing environment of business A, while the business side was very eager to see if such a stable system had any defects.

The overview of the entire process is as follows:

Define and measure the system’s “steady state”, accurately define indicators.
Create hypotheses.
Continuously import online traffic into the testing environment.
Simulate possible real-world events.
Prove or disprove the hypotheses.
Summarize and optimize feedback issues.

5.1 Steady State Indicators

During the operation of the business task system, the following indicators can represent the system’s steady state.

Overall CPS of the cluster service.
Task submission response delay.
Task submission success rate.
Task parsing failure rate.
Target acquisition success rate.
Data consistency and integrity.

5.2 Creating Hypotheses

Hypothesize that certain nodes may experience task submission response delays when performance or network-related issues exist, without affecting the steady state.

Hypothesize that node failures do not affect the steady state (not covered here).

5.3 Importing Online Traffic

Traffic replication and testing environment import will not be covered here.

5.4 Simulating Faults

Using system layer + network layer random injection scenarios for simulation.

The following table shows the configuration of fault scenarios in the fault injection platform.

Injection Function Name	Description
CPU Injection	Load reaches 90%
Memory Injection	Memory consumption reaches 90%
Network Packet Loss	Packet loss rate 30%
Network Latency	Increase by 100ms

In the platform, we select the system scenario fault templates and injection targets provided (in an injection task, a single machine can only inject one scenario, and the scenario and machine are randomly selected).

By selecting the injection templates and targets, an injection plan is generated, as shown in the following diagram:

Execute the injection, as shown in the following diagram. During the injection process, the operator can end and pause the injection task at any time.

Manually pause the specified target; clicking continue will resume the injection.

5.5 Tracking and Observation

In the overall observation, the following issues were found:

Service CPS dropped sharply.
Successful task submissions had a significant increase in failure rate.
User task submissions experienced varying degrees of delay.
Associated service processes exited abnormally.
Monitoring is not granular; it can detect increases in CPS and failure rates, but locating the issues is difficult.

The sharp drop in the overall service CPS indicates a severe decline in the service’s processing capacity.

The increased number of task execution failures indicates that tasks were submitted, but the service did not complete them as expected.

User submission delays and timeouts increased.

End the entire injection task.

After the injection task ended, all business indicators gradually returned to normal.

5.6 Proving or Disproving Hypotheses

Hypothesis: Certain nodes may experience task submission response delays when performance or network-related issues exist, without affecting the steady state.
Prove: There are task submission response delays, indirectly indicating a decrease in CPS.
Disprove: Discovered unexpected increases in task execution failures and abnormal exits of associated processes, as well as issues with monitoring granularity and lack of detail.

5.7 Summary and Feedback

Add the summary and feedback from this injection:

6. Other Issues

Q: Can any user perform injections on any machine?

A: No.

The platform currently involves multiple dimensions of permissions: user employment status, platform user permissions, user and machine permissions, and platform and machine permissions.

Q: If the target machine becomes overloaded or experiences severe packet loss due to injection, and the SRE team cannot log in, how can the platform terminate normally?

A: When injecting into the target machine, the system will whitelist specific conditions; for example, scenarios involving packet loss or network interruptions may prevent other users from logging in, but our task machines remain unaffected. For overload injections, we control the level of overload to avoid making the machine completely inoperable. There are also personalized parameter configurations that users can adjust within the set limits.

Q: Can the platform automatically determine when to stop injections? If not, will the injection continue indefinitely if the operator forgets to terminate it?

A: Currently, it cannot automatically determine when to stop injections. If the operator forgets to manually terminate the operation, there is a continuous injection time set during the selection of injection templates or custom injection plans, and if this time is exceeded, the injection task will end automatically.

7. Future Planning

7.1 Functional Planning

Gradually improve the fault scenarios depicted in the fault portraits drawn from the perspectives of IaaS, PaaS, and SaaS layers mentioned in this article.
Combine SLO, degradation, and traffic scheduling to minimize the explosion radius and prepare for online injection experiments.
Be able to actively probe and map the topology of traffic and service call relationships.

Let’s Change the World Together~

2021 IDCF Open Course Enrollment Ongoing

The 2021 “IDCF DevOps Hackathon” has begun in Beijing, Shenzhen, and Shanghai, along with many public courses such as “End-to-End DevOps Continuous Delivery (5P) Workshop”, “DevOps Practice Based on Boat House”, “Innovative Design Thinking Workshop”, “Agile Project Management Practical Sandbox Exercise”, etc. Don’t miss out on signing up~

Xiaomi’s Practice of Fault Injection Platform Based on Chaosblade

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply