Injecting Faults and Debugging with ChaosBlade-Operator in K8S

▌Introduction: A New Tool for Chaos Engineering

ChaosBlade, as an open-source chaos engineering toolchain from Alibaba, abstracts chaos experiments into Kubernetes CRD resources through the ChaosBlade-Operator project. Utilizing the open capabilities of K8S, CRD resources can manage all software and hardware resources and interact with various resources, achieving declarative chaos experiment management that makes fault injection as natural and smooth as deploying applications. This design provides each experiment with a clear lifecycle state, offering an out-of-the-box chaos engineering solution for cloud-native environments.

▌Installation Steps

1.Add Helm repository and download the installation package

$ wget https://github.com/chaosblade-io/chaosblade-operator/releases/download/v1.7.1/chaosblade-operator-1.7.1.tgz
# Create a dedicated namespace
$ kubectl create namespace chaosblade

2.Install the Operator using Helm

$ helm install chaosblade-operator chaosblade-operator-1.7.1.tgz --namespace chaosblade \ 
--set blade.repository=private-repo-url \ 
--set operator.repository=private-repo-url

3. Verify Installation

$ kubectl get pod -n chaosblade | grep chaosblade

After starting, the ChaosBlade-Operator will deploy a chaosblade-operator Pod and a chaosblade-tool Pod on each node. If they are running normally, the installation is successful.

4.Uninstall the Operator

# Uninstall Helm Chart
$ helm uninstall chaosblade-operator -n chaosblade
# Delete CRD and namespace
$ kubectl delete crd chaosblades.chaosblade.io
$ kubectl delete namespace chaosblade

▌Testing Network Packet Loss

1.Prepare loss-pod-network.yaml

apiVersion: chaosblade.io/v1alpha1
kind: ChaosBlade
metadata:
  name: loss-pod-network
spec:
  experiments:
  - scope: pod
    target: network
    action: loss
    desc: "30% packet loss on eth0"
    matchers:
    - name: names
      value: ["k8s-pod-01-name"]
    - name: percent
      value: ["30"]
    - name: interface
      value: ["eth0"]
    - name: destination-ip
      value: ["172.10.0.0/16"]

2. Fault Injection

$ kubectl apply -f loss-pod-network.yaml

After applying the configuration, you can check the experiment status using kubectl get blade loss-node-network -o json. If the Status field shows Running, it indicates that it is effective.

If the fault injection is effective, there will be a 30% packet loss rate between the Pod named k8s-pod-01-name and the Pods in the 172.10.0.0/16 subnet.

▌Fault Troubleshooting

When the experiment does not work as expected, follow these steps to troubleshoot.

1.Basic Checks

$ kubectl get crd | grep chaosblade  # Confirm CRD is registered
$ kubectl -n chaosblade get pod  # Check Operator running status

2.Experiment Status Diagnosis

$ kubectl describe chaosblade loss-pod-network

Observe the Events field for scheduling information. If the Status field shows Error, check the parameter configuration.

3.Log Tracking

$ kubectl logs -l app=chaosblade-operator -n chaosblade --tail=200

Focus on the “Experiment Execute” log segment for detailed fault injection process information.

4.Kernel-Level Verification

# Get container PID
$ CONTAINER_ID=$(kubectl get pod <pod> -o jsonpath='{.status.containerStatuses[0].containerID}' | cut -d '/' -f3)
$ PID=$(crictl inspect $CONTAINER_ID | jq .info.pid)        # Enter network namespace, check network rules
$ nsenter -t <pid> -n tc -s qdisc show dev eth0
# Verify cgroups configuration (for memory fault injection)
$ cat /sys/fs/cgroup/memory/<container_id>/memory.usage_in_bytes</container_id></pid></pod>

5. Side Effect Troubleshooting

# Check iptables rules to avoid iptables rules overriding tc rules
$ nsenter -t <pid> -n iptables -t nat -L</pid>

▌Chaosblade Principles

1.Chaosblade-operator

ChaosBlade-operator implements fault injection in cloud-native environments using a three-layer architecture.

Injecting Faults and Debugging with ChaosBlade-Operator in K8S

2.Namespace

Namespace is a Linux kernel isolation mechanism that provides independent system resource views for processes, achieving environmental isolation between different process groups. Processes within each namespace can only perceive the resources belonging to them and cannot directly access resources from other namespaces.

Main Types:

– PID namespace: Isolates process IDs, allowing processes within a container to see only their own process tree.

– Mount namespace: Isolates filesystem mount points, giving containers an independent view of file directories.

– Network namespace: Isolates the network stack, allowing containers to configure independent IP addresses, routing tables, and firewall rules.

– UTS namespace: Isolates hostnames and domain names.

– IPC namespace: Isolates inter-process communication (such as semaphores and message queues).

Chaosblade utilizes the characteristics of namespaces to simulate network faults.

– Create independent netns to isolate the target Pod’s network

– Use the tc (Traffic Control) tool to add network rules

– Use iptables/NFTables for packet filtering

$ nsenter -t <container_id> -n</container_id>

3.Cgroups

Cgroups are a resource management mechanism provided by the Linux kernel, used to limit, allocate, and monitor resources (such as CPU, memory, I/O, network bandwidth, etc.) for process groups. Its core principle is to manage hierarchical process groups, allocating resource quotas to different control groups to ensure fair use of system resources in a multitasking environment.

Main Subsystems:

– cpu subsystem: Limits CPU time slice allocation

– memory subsystem: Controls memory usage limits

– blkio subsystem: Limits block device I/O

$ cat /sys/fs/cgroup/memory/<container-id>/memory.usage_in_bytes</container-id>

▌Conclusion

ChaosBlade-Operator achieves declarative fault management by resourceizing chaos experiments in Kubernetes. Based on the isolation mechanisms of cgroups and namespaces, it provides precise fault simulation capabilities while ensuring safety. Mastering the fault injection methods and troubleshooting techniques introduced in this article can help engineering teams quickly build a reliability verification system, laying a solid foundation for the stability of cloud-native systems.

Related posts

Leave a Comment Cancel reply