Chaos Engineering Practice: Fault Injection and Monitoring System with Chaos Mesh

1. Let’s clarify what Chaos Engineering is. A few days ago, my colleague Wang was mumbling in the break room: “Our system claims to be highly available, but who knows if it can really hold up when something goes wrong?” This hits the nail on the head—Chaos Engineering is essentially the study of proactively finding faults in a system. It’s like going to the gym to lift weights, specifically training weak points to avoid injuries.

Our company has suffered losses before; during a major promotion, a cache node went down, paralyzing the entire transaction chain. After that, we learned our lesson and started using Chaos Mesh for destructive testing, like hiring a personal bodyguard for the system to target its vulnerabilities.

2. Deploying the tool is easier than expected. At first, I thought the documentation would be complex, but in practice, it turned out to be similar to building with Lego. Just run a couple of commands (make sure to use the latest version):

curl -sSL https://mirrors.chaos-mesh.org/v2.0.0/install.sh | bash kubectl apply -f https://raw.githubusercontent.com/chaos-mesh/chaos-mesh/v2.0.0/manifests/chaos-mesh.yaml

After running that, can’t access the dashboard? Still inexperienced, need to set up port forwarding:

kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

After entering localhost:2333 in the browser, the interface looks like a high-end task manager.

3. The correct way to cause disruptions. Last month, we conducted a “health check” on our order system, mainly testing these two destructive methods:

Simulating network latency. Write a YAML file just like ordering takeout, selecting parameters:

apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: delay-test spec: action: delay selector: namespaces: ["order-service"] delay: latency: "300ms" jitter: "100ms"

After hitting the execute button, the monitoring of the call chain immediately turned red. At that moment, we realized that the frontend team had secretly modified the order interface, and the circuit breaker wasn’t even configured.

Randomly killing database processes. We couldn’t mess with the MySQL in the production environment, so we cloned the test cluster to try it out:

apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: db-kill spec: action: pod-kill selector: namespaces: ["test-mysql"]

The result scared the operations guy; he was checking the data when suddenly the connection dropped, thinking he had crashed the machine.

4. Monitoring must be armed to the teeth. After suffering a few hidden losses, we upgraded our monitoring system to this combination:

Prometheus is responsible for data collection, and the alert rules are more complex than high school math problems. For example, if a service’s CPU spikes to 80% for 2 minutes, WeChat can send ten alerts.

Grafana dashboards can now display the spike curves during failures, with fluctuations resembling an ECG. After a network jitter drill, the CTO looked at the ECG-like line and said, “If this really happens online…”

Log investigation relies on the ELK stack, combined with Jaeger’s call chain tracing. There was a strange issue where the database connection pool suddenly ran out at midnight, and we traced it back to a scheduled task that didn’t turn off the retry mechanism using the error stack and slow query logs.

5. Hard-earned practical experience. Here are some pitfalls that textbooks won’t tell you:

•

Never casually initiate node failure drills in the production environment. Once, I accidentally forgot to filter labels and directly cut off the gateway node, causing the entire business to halt for half an hour. My back was soaked with sweat; now I check three times before every operation.

•

Don’t let the drill time exceed the expected fault recovery duration. For example, our payment system is designed for self-recovery in 5 minutes, so we set it to automatically recover in 3 minutes. Otherwise, it may trigger a domino effect.

•

Always save drill snapshots! After a disk IO stress test caused system anomalies, we relied on replaying the monitoring footage to identify a file handle leak.

•

Post-drill reviews cannot be a formality. Once, the drill went smoothly, but the next day when a real failure occurred, we found that we had missed testing the service mesh layer. Now, every meeting summary requires signatures from all teams for confirmation.

6. My personal debugging tips. I’ve discovered a few useful but lesser-known features:

Scheduled fault triggers: Automatically start network packet loss at 3 AM, and just check the report the next day. But remember to give a heads-up; once, a test almost made the on-call colleague think they were under a hacker attack.

Linked drills: First cut the database, then switch the network, and finally kill the container. This combination can uncover many hidden bugs. But it should be done gradually, like increasing the spiciness in Sichuan cuisine.

Resource consumption faults are actually the most dangerous. Once, simulating CPU overload in the test cluster directly crashed the host’s Docker. Now, whenever we encounter such tests, we halve the parameters before proceeding.

7. A veteran’s musings. After three years of working in Chaos Engineering, my biggest realization is:

True stability is not in the architecture diagram but in the red line areas of the drill reports. Our system’s strongest caching service was, ironically, the one that crashed the hardest during the initial drills.

Don’t pursue 100% stability; it will burn through the budget. Our principle is that core link failures should recover in 10 minutes, while non-core services are allowed to run in degraded mode for half an hour.

The most valuable insights come from the “you thought” moments exposed during drills. For example, a service claimed to support retries, but when tested, we found their retry interval was a fixed 3 seconds, which exacerbated the avalanche effect.

Recently, I’ve been considering using machine learning to analyze historical failure data, and I might try an intelligent drill system next time. However, for now, I’m still comfortable with this set of Chaos Mesh, as practical fault injection is more reliable than any theory.

Related posts

Leave a Comment Cancel reply