Author丨Alex Leong
Translator丨Liu Yameng
Well-structured microservices applications have a characteristic that they can gracefully tolerate single-point service failures. When these failures manifest as service crashes, Kubernetes does a great job of recovering from service crash failures by replacing crashed Pods with new ones. However, failures can also be more subtle, such as causing an increased error rate in service responses. For this type of failure, Kubernetes cannot automatically recover, which still leads to partial functionality loss.
The Traffic Split API of the Service Mesh Interface (SMI) can easily achieve application fault injection. This is an implementation-agnostic and cross-service mesh way to perform fault injection.
To implement this form of fault injection, first, we deploy a new service that only returns errors. It can be a simple service, like an NGINX service configured to return HTTP 500, or a more complex service designed to return specific errors for testing certain conditions. Next, we create a Traffic Split resource to guide the service mesh to send traffic to the error service according to a percentage. For example, by sending 10% of service traffic to the error service, we artificially inject a 10% error rate into that service.
Let’s look at an example using Linkerd as the service mesh implementation.
First, we install the Linkerd CLI and deploy it to the Kubernetes cluster:
curl https://run.linkerd.io/install | sh
export PATH=$PATH:$HOME/.linkerd2/bin
linkerd install | kubectl apply -f -
linkerd check
Then, we install a “booksapp” sample application:
linkerd inject https://run.linkerd.io/booksapp.yml | kubectl apply -f -
One of the services in this application is already configured with an error rate, but this example is to illustrate that we can inject errors into the application without any support, so we need to remove the error rate configured in the application:
kubectl edit deploy/authors# Find and remove these lines:# - name: FAILURE_RATE# value: "0.5"
We can see that the application is running normally:
linkerd stat deployNAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
authors 1/1 100.00% 6.6rps 3ms 58ms 92ms 6
books 1/1 100.00% 8.0rps 4ms 81ms 119ms 6
traffic 1/1 - - - - - -
webapp 3/3 100.00% 7.7rps 24ms 91ms 117ms 9
Now, we create an error service. Here, I use an NGINX configured to return HTTP 500 status code and create a file named error-injector.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: error-injector
labels:
app: error-injector
spec:
selector:
matchLabels:
app: error-injector
replicas: 1
template:
metadata:
labels:
app: error-injector
spec:
containers:
- name: nginx
image: nginx:alpine
ports:
- containerPort: 80
name: nginx
protocol: TCP
volumeMounts:
- name: nginx-config
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
volumes:
- name: nginx-config
configMap:
name: error-injector-config
---
apiVersion: v1
kind: Service
metadata:
labels:
app: error-injector
name: error-injector
spec:
clusterIP: None
ports:
- name: service
port: 7002
protocol: TCP
targetPort: nginx
selector:
app: error-injector
type: ClusterIP
---
apiVersion: v1
data:
nginx.conf: |
events {
worker_connections 1024;
}
http {
server {
location / {
return 500;
}
}
}
kind: ConfigMap
metadata:
name: error-injector-config
Deploy the error-injector.yaml file:
kubectl apply -f error-injector.yaml
Now, we create a Traffic Split resource that will redirect 10% of the traffic from the “books” service to the “error-injector” error service. This resource file is named error-split.yaml:
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: error-split
spec:
service: books
backends:
- service: books
weight: 900m
- service: error-injector
weight: 100m
Deploy the error-split.yaml file:
kubectl apply -f error-split.yaml
Now, we can see that the call error rate from webapp to books is 10%:
linkerd routes deploy/webapp --to service/books
ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
[DEFAULT] books 90.66% 6.6rps 5ms 80ms 96ms
We can also see how the application gracefully handles these faults:
kubectl port-forward deploy/webapp 7000 && open http://localhost:7000
If we refresh the page a few times, we sometimes see an internal service error page.
We have learned some valuable lessons about how the application handles service errors. Now, we restore our application by simply deleting the Traffic Split resource:
kubectl delete trafficsplit/error-split
In this article, we demonstrated a quick and easy way to perform fault injection at the service level by dynamically redirecting a portion of traffic to a simple “always failing” target service using the SMI API (supported by Linkerd). The advantage of this approach is that we can achieve fault injection solely through the SMI API without changing any application code.
Of course, fault injection is a broad topic, and there are many more complex fault injection methods, including routing failures, causing requests matching specific conditions to fail, or propagating a single “poison pill” request throughout the entire application topology. These types of fault injection require more supporting mechanisms than what is covered in this article.
Linkerd is a community project hosted by the Cloud Native Computing Foundation (CNCF). Linkerd is hosted on GitHub, and the community is active on Slack, Twitter, and mailing lists. Interested developers can download and try it out.
Original link:
https://linkerd.io/2019/07/18/failure-injection-using-the-service-mesh-interface-and-linkerd/index.html
Click to See Less Bugs👇