Chaos Engineering Tools: Implementing Pod-Level Fault Injection with Go

Click the “blue text” above to follow us

Late-night overtime, just about to leave, suddenly the production environment alarms go off! Checking logs, monitoring, troubleshooting… After two hours of hassle, I found out it was a cascading failure caused by a timeout in a dependent service. Sigh! Does this situation sound familiar? In a microservices architecture, the stability of the system is like playing dominoes—when one service falls, it can trigger a series of chain reactions.

We all hope the system is robust enough, but how do we verify that? Wait until a real problem occurs in the production environment to “practice”? The cost is too high! This is where chaos engineering comes into play. Its concept is simple: rather than waiting for failures to find you, proactively create failures to test the system’s resilience.

go.

Why is Go a perfect partner for chaos engineering?

When it comes to developing chaos engineering tools, Go is simply a match made in heaven! Why? Go’s concurrency model is perfect for simulating various failure scenarios. Want to inject multiple types of failures simultaneously? goroutine can handle it in no time. Need precise control over the duration of the failure? timer and context make it easy.

Moreover, Kubernetes (K8s) is written in Go. Implementing Pod-level fault injection in Go feels like going back home—familiar and natural. High performance, easy deployment, convenient cross-compilation… it couldn’t be better!

Remember, true high availability is not about preventing failures, but about maintaining service when failures occur. Chaos engineering is the practice of this philosophy.

pod.

Pod-Level Fault Injection: Simply put, it’s “precise disruption”

Traditional fault testing is too blunt, directly pulling the network cable or shutting down machines… While direct, it’s not precise. Pod-level fault injection is like “precision medicine”—it only targets specific Pods for “surgery,” while other Pods continue to run normally.

What are the benefits of this approach? You can accurately test the impact of a single service failure on the overall system, rather than taking down all services at once. More importantly, this testing can be conducted in a non-production environment, making it safe and controllable.

Our fault injection module primarily supports the following types of faults:

– CPU stress fault: Cause Pod CPU usage to spike

– Memory leak simulation: Gradually consume Pod memory

– Network latency: Increase communication delay between services

– Network packet loss: Simulate an unstable network environment

– Process crash: Force a process within the container to exit

go1.

Code Implementation: Just a few dozen lines of Go code to handle the core logic of fault injection

Let’s see how to implement a simple CPU stress fault injection in Go:

func InjectCPUStress(ctx context.Context, podName, namespace string, cpuPercent int, duration time.Duration) error {
    // Get the k8s client
    clientset, err := getK8sClient()
    if err != nil {
        return fmt.Errorf("Failed to get k8s client: %v", err)
    }

    // Build the command to execute: use stress-ng tool to inject CPU stress
    cmd := []string{
        "stress-ng", "--cpu", "0", 
        "--cpu-load", fmt.Sprintf("%d", cpuPercent),
        "--timeout", fmt.Sprintf("%ds", int(duration.Seconds())),
    }

    // Execute the command in the target Pod
    req := clientset.CoreV1().RESTClient().Post().
        Resource("pods").
        Name(podName).
        Namespace(namespace).
        SubResource("exec")

    // Execute the command and return the result
    return executeCommand(req, cmd, ctx)
}

Looks simple, right? The core idea is to use the K8s API to execute the fault injection command in the target Pod. Here we used the stress-ng tool to simulate CPU stress, but you can use other methods as well.

The command execution function is also quite straightforward:

func executeCommand(req *rest.Request, cmd []string, ctx context.Context) error {
    option := &v1.PodExecOptions{
        Command: cmd,
        Stdin:   false,
        Stdout:  true,
        Stderr:  true,
        TTY:     false,
    }

    exec, err := remotecommand.NewSPDYExecutor(config, "POST", req.URL())
    if err != nil {
        return err
    }

    // Execute the command
    return exec.StreamWithContext(ctx, remotecommand.StreamOptions{
        Stdout: os.Stdout,
        Stderr: os.Stderr,
    })
}

Practical Application: Chaos Experiments Make Your System More Robust

Talk is cheap! Let’s conduct a simple chaos experiment: inject a 5-second network delay into the order service and see how the user ordering process is affected. Before the experiment, our system’s average order time was 1.2 seconds. After injecting the fault? Ha! It jumped to 6.5 seconds, and some requests even timed out.

Problems Exposed: We found that after the order service timed out, the inventory service did not roll back correctly, leading to “phantom inventory.” If this were in a production environment, it would be a major bug! Thanks to the chaos experiment, we discovered this issue in advance.

Another example: we injected faults into the database connection pool and found that the system lacked reasonable circuit-breaking and degradation mechanisms. After fixing it, even if the database occasionally misbehaves, core functions remain available, while non-core functions are temporarily degraded.

Remember one principle: actively triggering faults in a controlled environment is always better than passively responding in a production environment. Chaos engineering is the practice of this philosophy.

How easy is it to implement this tool in Go? A junior Go developer can set up the basic framework within a week. If you already have a foundational knowledge of K8s, you can do it even faster.

What’s most surprising is the practical results. After our team implemented chaos engineering, the online failure rate decreased by 46%, and the average recovery time was reduced by 38%. Numbers speak for themselves!

Chaos engineering is not about destruction, but about construction. It enables our systems to withstand “small winds and waves” and be capable of facing “huge storms.” Just like practicing martial arts, you have to take hits to improve your skills. Has your system been “hit” today? No? Then hurry up and write a fault injection tool in Go to test it out!

Related posts

Leave a Comment Cancel reply