Chaos Engineering Practice: Fault Injection with SMI and Linkerd

Chaos Engineering Practice: Fault Injection with SMI and Linkerd

Author丨Alex Leong

Translator丨Liu Yameng

Application fault injection is a form of chaos engineering that artificially increases the error rate of certain services in microservices applications to observe its impact on the entire system. Traditionally, some type of fault injection library needs to be added to the service code to perform fault injection on the application. Fortunately, service meshes provide a way to inject faults into applications without modifying or refactoring the services.

Well-structured microservices applications have a characteristic that they can gracefully tolerate single-point service failures. When these failures manifest as service crashes, Kubernetes does a great job of recovering from service crash failures by replacing crashed Pods with new ones. However, failures can also be more subtle, such as causing an increased error rate in service responses. For this type of failure, Kubernetes cannot automatically recover, which still leads to partial functionality loss.

Injecting Errors Using Traffic Split SMI API

The Traffic Split API of the Service Mesh Interface (SMI) can easily achieve application fault injection. This is an implementation-agnostic and cross-service mesh way to perform fault injection.

To implement this form of fault injection, first, we deploy a new service that only returns errors. It can be a simple service, like an NGINX service configured to return HTTP 500, or a more complex service designed to return specific errors for testing certain conditions. Next, we create a Traffic Split resource to guide the service mesh to send traffic to the error service according to a percentage. For example, by sending 10% of service traffic to the error service, we artificially inject a 10% error rate into that service.

Let’s look at an example using Linkerd as the service mesh implementation.

Example

First, we install the Linkerd CLI and deploy it to the Kubernetes cluster:

curl https://run.linkerd.io/install | sh
export PATH=$PATH:$HOME/.linkerd2/bin
linkerd install | kubectl apply -f -
linkerd check

Then, we install a “booksapp” sample application:

linkerd inject https://run.linkerd.io/booksapp.yml | kubectl apply -f -

One of the services in this application is already configured with an error rate, but this example is to illustrate that we can inject errors into the application without any support, so we need to remove the error rate configured in the application:

kubectl edit deploy/authors# Find and remove these lines:#        - name: FAILURE_RATE#          value: "0.5"

We can see that the application is running normally:

linkerd stat deployNAME             MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
authors             1/1   100.00%   6.6rps           3ms          58ms          92ms          6
books               1/1   100.00%   8.0rps           4ms          81ms         119ms          6
traffic             1/1         -        -             -             -             -          -
webapp              3/3   100.00%   7.7rps          24ms          91ms         117ms          9

Now, we create an error service. Here, I use an NGINX configured to return HTTP 500 status code and create a file named error-injector.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: error-injector
  labels:
    app: error-injector
spec:
  selector:
    matchLabels:
      app: error-injector
  replicas: 1
  template:
    metadata:
      labels:
        app: error-injector
    spec:
      containers:
        - name: nginx
          image: nginx:alpine
          ports:
          - containerPort: 80
            name: nginx
            protocol: TCP
          volumeMounts:
            - name: nginx-config
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
      volumes:
        - name: nginx-config
          configMap:
            name: error-injector-config
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: error-injector
  name: error-injector
spec:
  clusterIP: None
  ports:
  - name: service
    port: 7002
    protocol: TCP
    targetPort: nginx
  selector:
    app: error-injector
  type: ClusterIP
---
apiVersion: v1
data:
  nginx.conf: |
    events {
        worker_connections  1024;
    }
    http {
        server {
            location / {
                return 500;
            }
        }
    }
kind: ConfigMap
metadata:
  name: error-injector-config

Deploy the error-injector.yaml file:

kubectl apply -f error-injector.yaml

Now, we create a Traffic Split resource that will redirect 10% of the traffic from the “books” service to the “error-injector” error service. This resource file is named error-split.yaml:

apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
  name: error-split
spec:
  service: books
  backends:
  - service: books
    weight: 900m
  - service: error-injector
    weight: 100m

Deploy the error-split.yaml file:

kubectl apply -f error-split.yaml

Now, we can see that the call error rate from webapp to books is 10%:

linkerd routes deploy/webapp --to service/books
ROUTE       SERVICE   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99
[DEFAULT]     books    90.66%   6.6rps           5ms          80ms          96ms

We can also see how the application gracefully handles these faults:

kubectl port-forward deploy/webapp 7000 && open http://localhost:7000

If we refresh the page a few times, we sometimes see an internal service error page.

Chaos Engineering Practice: Fault Injection with SMI and Linkerd

We have learned some valuable lessons about how the application handles service errors. Now, we restore our application by simply deleting the Traffic Split resource:

kubectl delete trafficsplit/error-split
Conclusion

In this article, we demonstrated a quick and easy way to perform fault injection at the service level by dynamically redirecting a portion of traffic to a simple “always failing” target service using the SMI API (supported by Linkerd). The advantage of this approach is that we can achieve fault injection solely through the SMI API without changing any application code.

Of course, fault injection is a broad topic, and there are many more complex fault injection methods, including routing failures, causing requests matching specific conditions to fail, or propagating a single “poison pill” request throughout the entire application topology. These types of fault injection require more supporting mechanisms than what is covered in this article.

Linkerd is a community project hosted by the Cloud Native Computing Foundation (CNCF). Linkerd is hosted on GitHub, and the community is active on Slack, Twitter, and mailing lists. Interested developers can download and try it out.

Original link:

https://linkerd.io/2019/07/18/failure-injection-using-the-service-mesh-interface-and-linkerd/index.html

Chaos Engineering Practice: Fault Injection with SMI and Linkerd

Click to See Less Bugs👇

Leave a Comment