Exploring Fault Injection in Chaos Engineering

Exploring Fault Injection in Chaos Engineering

Authors | Ye Fei, Qiong Gu

Introduction: Do you think chaos engineering is far from you? But the moment a failure occurs is not chosen by you, but rather that moment chooses you. All you can do is prepare for it. Chaos engineering has been applied internally at Alibaba for many years, and the open-source project ChaosBlade is the crystallization of Alibaba’s years of experience in combating failures through fault injection. To help everyone gain a deeper understanding of its implementation principles and how to expand the required component fault injection, we have prepared a series of detailed technical analyses: architecture, model, protocol, bytecode, plugin, and practical application.

Original title: “Technical Analysis of Java Scene Chaos Engineering Implementation Series (1) | Architecture”

Exploring Fault Injection in Chaos Engineering

Preface

In a distributed system architecture, the dependencies between services are increasingly complex, making it difficult to assess the impact of a single service failure on the entire system. Additionally, long request chains and imperfect monitoring alerts make it harder to discover and locate issues. With rapid iterations in business and technology, ensuring the stability and high availability of systems faces significant challenges.

We know that the moment a failure occurs is not chosen by you, but rather that moment chooses you. All you can do is prepare for it. Therefore, a crucial aspect of building a stable system is chaos engineering, which involves fault injection within a controllable range or environment to continuously improve the stability and high availability of the system.

ChaosBlade is a chaos engineering tool that follows the principles of chaos engineering experiments and provides rich fault scenario implementations to help distributed systems enhance fault tolerance and recoverability. It enables fault injection at the underlying level, characterized by simplicity, non-intrusiveness, and strong extensibility. The chaosblade-exec-jvm project achieves zero-cost fault injection for Java application services. It not only supports mainstream framework components like Dubbo, Servlet, RocketMQ, etc., but also allows the injection of delays and exceptions into specified classes and methods, as well as the implementation of complex experimental scenarios through Java and Groovy scripts.

  • ChaosBlade GitHub Address: https://github.com/chaosblade-io/chaosblade

  • chaosblade-exec-jvm GitHub Address: https://github.com/chaosblade-io/chaosblade-exec-jvm

To help everyone gain a deeper understanding of its implementation principles and how to expand the required component fault injection, we have divided it into six articles for detailed technical analysis: architecture, model, protocol, bytecode, plugin, and practical application. This article will provide a detailed introduction to the overall architectural design of chaosblade-exec-jvm, allowing users to have a certain understanding of chaosblade-exec-jvm.

Exploring Fault Injection in Chaos Engineering

System Design

Exploring Fault Injection in Chaos Engineering

Chaosblade-exec-jvm is based on JVM-Sandbox for bytecode modification. Executing the ChaosBlade tool can mount the fault injection Java Agent to the specified application process. The Java Agent follows the chaos experiment model design and uses a pluggable design to extend support for different Java components. It is easy to extend plugins to support more fault scenarios, with plugins defined based on AOP design, including <span>Advice</span>, <span>Enhancer</span>, <span>PointCut</span>, and combined with the chaos experiment model defining <span>ModelSpec</span>, <span>Target</span>, <span>Matcher</span>, and <span>Action</span>.

Chaosblade-exec-jvm downloads the JVM-Sandbox release package during the <span>make build</span> compilation and packaging process. After compilation and packaging, chaosblade-exec-jvm acts as a module of JVM-Sandbox. After loading the Agent, it also listens to JVM-Sandbox events to manage the entire chaos experiment process, using Java Agent technology to implement class transformation for fault injection.

Exploring Fault Injection in Chaos Engineering

Principle Analysis

In daily backend application development, we often need to provide API interfaces to clients, and these API interfaces inevitably experience timeouts, exceptions, etc., due to network, system load, and other reasons. When using Java, we typically use Servlet to provide API interfaces for the HTTP protocol. The chaosblade-exec-jvm supports the Servlet plugin, injecting timeout and custom exception fault capabilities. This article will analyze the fault injection process of chaosblade-exec-jvm by injecting delay fault capability into the Servlet API interface.

For the Servlet API interface <span>/topic</span>, delaying it by 3 seconds, the steps are as follows:

// Mount the Agent
blade prepare jvm --pid 888
{"code":200,"success":true,"result":"98e792c9a9a5dfea"}

// Inject fault capability
blade create servlet --requestpath=/topic delay --time=3000 --method=post
{"code":200,"success":true,"result":"52a27bafc252beee"}

// Revoke fault capability
blade destroy 52a27bafc252beee

// Unload the Agent
blade revoke 98e792c9a9a5dfea

1. Execution Process

Below is a detailed introduction to the fault injection process using Servlet request delay as an example.

Exploring Fault Injection in Chaos Engineering

  1. ChaosBlade issues a mount command, mounting the Sandbox to the application process, activating the Java Agent, for example <span>blade p jvm --pid 888</span>;

  2. After mounting the Sandbox, load the chaosblade-exec-jvm module, and load plugins such as ServletPlugin, DubboPlugin, etc.;

  3. Match the pointcut of the ServletPlugin, and register event listeners for HttpServlet’s doPost and doGet methods;

  4. ChaosBlade issues a fault rule command <span>blade create servlet --requestpath=/topic delay --time=3000 --method=post</span>;

  5. Match the fault rule, such as –requestpath=/topic, accessing http://127.0.0.1/topic matches the rule successfully;

  6. After successfully matching the fault rule, trigger the fault, such as delay fault or custom exception throw;

  7. ChaosBlade issues a command to unload the JavaAgent, such as <span>blade revoke 98e792c9a9a5dfe</span>.

2. Code Analysis

1)Mounting the Agent

blade p jvm --pid 888

After this command is issued, the Agent will be mounted in the target Java application process, triggering the SandboxModule onLoad() event, initializing PluginLifecycleListener to manage the lifecycle of the plugin, and also triggering the SandboxModule onActive() event to load some plugins and their corresponding ModelSpec.

// Agent loading event
public void onLoad() throws Throwable {
  ManagerFactory.getListenerManager().setPluginLifecycleListener(this);
  dispatchService.load();
  ManagerFactory.load();
}
// ChaosBlade module activation implementation
public void onActive() throws Throwable {
  loadPlugins();
}

2)Loading Plugins

Loading Method

Loading Condition

SandboxModule onActive() event

Pointcut, ClassMatcher, MethodMatcher are all non-empty

blade create command issued

ModelSpect is of PreCreateInjectionModelHandler type, and ActionFlag is not DirectlyInjectionAction type

During plugin loading, an event listener SandboxEnhancerFactory.createAfterEventListener(plugin) is created, which listens for events of interest, such as BeforeAdvice, AfterAdvice, etc. The specific implementation is as follows:

// Load plugin
public void add(PluginBean plugin) {
    PointCut pointCut = plugin.getPointCut();
    if (pointCut == null) {
        return;
    }
    String enhancerName = plugin.getEnhancer().getClass().getSimpleName();
    // Create filter PointCut match
    Filter filter = SandboxEnhancerFactory.createFilter(enhancerName, pointCut);

    // Event listener
    int watcherId = moduleEventWatcher.watch(filter, SandboxEnhancerFactory.createBeforeEventListener(plugin), Event.Type.BEFORE);
    watchIds.put(PluginUtil.getIdentifier(plugin), watcherId);
}

3)Matching PointCut

After the SandboxModule onActive() event triggers the loading of the Plugin, the SandboxEnhancerFactory creates a Filter. The Filter internally filters through the ClassMatcher and MethodMatcher of the PointCut.

public static Filter createFilter(final String enhancerClassName, final PointCut pointCut) {
  return new Filter() {
    @Override
    public boolean doClassFilter(int access, String javaClassName, String superClassTypeJavaClassName,
                                 String[] interfaceTypeJavaClassNameArray,
                                 String[] annotationTypeJavaClassNameArray
                                ) {
      // ClassMatcher matching
      ClassMatcher classMatcher = pointCut.getClassMatcher();
      ...
    }

    @Override
    public boolean doMethodFilter(int access, String javaMethodName,
                                  String[] parameterTypeJavaClassNameArray,
                                  String[] throwsTypeJavaClassNameArray,
                                  String[] annotationTypeJavaClassNameArray) {
       // MethodMatcher matching
      MethodMatcher methodMatcher = pointCut.getMethodMatcher();
      ...
  };
}

4)Triggering Enhancer

If the plugin has been loaded, and the target application can match the Filter, the EventListener can be triggered. However, chaosblade-exec-jvm manages the state internally through StatusManager, so the fault capability will not be triggered.

For example, when the BeforeEventListener triggers the beforeAdvice() method of BeforeEnhancer, it is interrupted when checking ManagerFactory.getStatusManager().expExists(targetName). The specific implementation is as follows:

public void beforeAdvice(String targetName, 
                         ClassLoader classLoader, 
                         String className,
                         Object object,
                         Method method, 
                         Object[] methodArguments) throws Exception {

  // Check the status of the experiment
  if (!ManagerFactory.getStatusManager().expExists(targetName)) {
    return;
  }
  EnhancerModel model = doBeforeAdvice(classLoader, className, object, method, methodArguments);
  if (model == null) {
    return;
  }
  ...
  // Injection phase
  Injector.inject(model);
}

5)Creating Chaos Experiment

blade create servlet --requestpath=/topic delay --time=3000

After this command is issued, it triggers the SandboxModule @Http(“/create”) annotated method, which dispatches the event to <span>com.alibaba.chaosblade.exec.service.handler.CreateHandler</span> for processing.

After determining the necessary uid, target, action, and model parameters, it calls handleInjection. handleInjection registers this experiment with the status manager. If the plugin type is PreCreateInjectionModelHandler, it will preprocess some items. If the Action type is DirectlyInjectionAction, it will directly perform fault capability injection without going through Enhancer, such as JVM OOM fault capability.

public Response handle(Request request) {
  if (unloaded) {
    return Response.ofFailure(Code.ILLEGAL_STATE, "the agent is uninstalling");
  }
  // Check suid, suid is the context ID of an experiment
  String suid = request.getParam("suid");
  ...
  return handleInjection(suid, model, modelSpec);
}

private Response handleInjection(String suid, Model model, ModelSpec modelSpec) {
  RegisterResult result = this.statusManager.registerExp(suid, model);
  if (result.isSuccess()) {
    // Check if pre-created
    applyPreInjectionModelHandler(suid, modelSpec, model);
  }
}
ModelSpec:
  • <span>com.alibaba.chaosblade.exec.common.model.handler.PreCreateInjectionModelHandler</span> pre-create

  • <span>com.alibaba.chaosblade.exec.common.model.handler.PreDestroyInjectionModelHandler</span> pre-destroy

private void applyPreInjectionModelHandler(String suid, ModelSpec modelSpec, Model model)
  throws ExperimentException {
  if (modelSpec instanceof PreCreateInjectionModelHandler) {
    ((PreCreateInjectionModelHandler)modelSpec).preCreate(suid, model);
  }
}
...
DirectlyInjectionAction:

If the ModelSpec is of PreCreateInjectionModelHandler type, and the ActionSpec type is DirectlyInjectionAction, it will directly perform fault capability injection, such as JVM OOM fault capability. If the ActionSpec type is not DirectlyInjectionAction, it will load the plugin.

DirectlyInjectionAction

Not DirectlyInjectionAction

PreCreateInjectionModelHandler (ModelSpec)

Directly perform fault capability injection

Load plugin

PreDestroyInjectionModelHandler (ModelSpec)

Stop fault capability injection

Unload plugin

private Response handleInjection(String suid, Model model, ModelSpec modelSpec) {
    // Register
    RegisterResult result = this.statusManager.registerExp(suid, model);
    if (result.isSuccess()) {
        // handle injection
        try {
            applyPreInjectionModelHandler(suid, modelSpec, model);
        } catch (ExperimentException ex) {
            this.statusManager.removeExp(suid);
            return Response.ofFailure(Response.Code.SERVER_ERROR, ex.getMessage());
        }

        return Response.ofSuccess(model.toString());
    }
    return Response.ofFailure(Response.Code.DUPLICATE_INJECTION, "the experiment exists");
}

Upon successful registration, it returns uid. If the fault capability injection was performed directly at this stage, or if the custom Enhancer advice returns null, the fault will not be triggered through the Inject class.

6)Injecting Fault Capability

The method of injecting fault capability ultimately calls ActionExecutor to execute the fault capability.

  • Inject through Injector;

  • DirectlyInjectionAction directly injects, skipping the Inject class call stage, such as JVM OOM fault capability.

DirectlyInjectionAction directly injects without going through the Enhancer parameter wrapping match directly to the fault triggering ActionExecutor execution stage. If it is Injector injection, since the StatusManager has already registered the experiment, when the event is triggered again, the check ManagerFactory.getStatusManager().expExists(targetName) will not be interrupted, and it will continue down to the custom Enhancer. In the custom Enhancer, the original method parameters, types, etc., can be accessed, and even other methods of the original type can be reflected upon. This approach carries significant risk, and generally, it often involves taking some member variables or getter methods for parameter matching during the Inject phase.

7)Wrapping Matching Parameters

Custom Enhancers, such as ServletEnhancer, wrap parameters that need to match command line inputs within MatcherModel, and then wrap the EnhancerModel to return. For example, if <span>--requestpath</span> = <span>/index</span>, then requestpath equals requestURI; <span>--querystring="name=xx"</span> performs custom matching. After the parameters are wrapped, they are checked during the Injector.inject(model) stage.

public EnhancerModel doBeforeAdvice(ClassLoader classLoader, String className, Object object,
                                    Method method, Object[] methodArguments)
        throws Exception {
    Object request = methodArguments[0];
    String requestURI = ReflectUtil.invokeMethod(request, ServletConstant.GET_REQUEST_URI, new Object[]{}, false);
    String requestMethod = ReflectUtil.invokeMethod(request, ServletConstant.GET_METHOD, new Object[]{}, false);

    MatcherModel matcherModel = new MatcherModel();
    matcherModel.add(ServletConstant.METHOD_KEY, requestMethod);
    matcherModel.add(ServletConstant.REQUEST_PATH_KEY, requestURI);

    Map<String, Object> queryString = getQueryString(requestMethod, request);

    EnhancerModel enhancerModel = new EnhancerModel(classLoader, matcherModel);
    // Custom parameter matching
    enhancerModel.addCustomMatcher(ServletConstant.QUERY_STRING_KEY, queryString, ServletParamsMatcher.getInstance());
    return enhancerModel;
}

8)Judging Precondition

The Inject stage first retrieves the experiments registered with the StatusManage, and performs parameter comparison through compare(model, enhancerModel). If the comparison fails, it returns, and limitAndIncrease(statusMetric) checks the –effect-count and –effect-percent to control the number and percentage of impacts.

public static void inject(EnhancerModel enhancerModel) throws InterruptProcessException {
    String target = enhancerModel.getTarget();
    List<StatusMetric> statusMetrics = ManagerFactory.getStatusManager().getExpByTarget(
        target);
    for (StatusMetric statusMetric : statusMetrics) {
      Model model = statusMetric.getModel();
      // Match command line input parameters
      if (!compare(model, enhancerModel)) {
        continue;
      }
      // Accumulate attack counts and check if the attack count has reached the effect count
      boolean pass = limitAndIncrease(statusMetric);
      if (!pass) {
        break;
      }
      enhancerModel.merge(model);
      ModelSpec modelSpec = ManagerFactory.getModelSpecManager().getModelSpec(target);
      ActionSpec actionSpec = modelSpec.getActionSpec(model.getActionName());
      // ActionExecutor executes fault capability
      actionSpec.getActionExecutor().run(enhancerModel);
      break;
    }
}

9)Triggering Fault Capability

Triggered by Inject or directly by DirectlyInjectionAction, finally, a custom ActionExecutor is called to generate faults, such as DefaultDelayExecutor. At this point, the fault capability has taken effect.

public void run(EnhancerModel enhancerModel) throws Exception {
    String time = enhancerModel.getActionFlag(timeFlagSpec.getName());
    Integer sleepTimeInMillis = Integer.valueOf(time);
     // Trigger delay
    TimeUnit.MILLISECONDS.sleep(sleepTimeInMillis);
}

3. Destroying Experiment

blade destroy 52a27bafc252beee

After this command is issued, it triggers the SandboxModule @Http(“/destroy”) annotated method, which dispatches the event to <span>com.alibaba.chaosblade.exec.service.handler.DestroyHandler</span> for processing, unregistering the status of this fault. At this point, when the Enhancer is triggered again, the StatusManager determines that the experiment status has been destroyed and will not perform fault capability injection.

// StatusManager checks experiment status
if (!ManagerFactory.getStatusManager().expExists(targetName)) {
    return;
}

If the plugin’s ModelSpec is of PreDestroyInjectionModelHandler type, and the ActionSpec type is DirectlyInjectionAction, it will stop fault capability injection. If the ActionSpec type is not DirectlyInjectionAction, it will unload the plugin.

// DestroyHandler unregisters experiment status
public Response handle(Request request) {
    String uid = request.getParam("suid");
    ...
    // Check uid
    if (StringUtil.isBlank(uid)) {
        if (StringUtil.isBlank(target) || StringUtil.isBlank(action)) {
            return false;
        }
        // Unregister status
        return destroy(target, action);
    }
    return destroy(uid);
}

4. Unloading Agent

blade revoke 98e792c9a9a5dfea

After this command is issued, it triggers the SandboxModule unload() event, and the plugin is unloaded, fully reclaiming all resources created by the Agent.

public void onUnload() throws Throwable {
    dispatchService.unload();
    ManagerFactory.unload();
    watchIds.clear();
}

Exploring Fault Injection in Chaos Engineering

Conclusion

This article uses the Servlet scenario as an example to provide a detailed introduction to the architectural design and implementation principles of the chaosblade-exec-jvm project. Subsequent articles will delve into the model, protocol, bytecode, plugin, and practical application, enabling readers to quickly expand the plugins they need.

As a chaos engineering experiment tool, the ChaosBlade project is not only simple to use but also supports a rich variety of experimental scenarios and easy scene expansion. The supported scenario domains are as follows:

  • Basic Resources: such as CPU, memory, network, disk, process, etc.;

  • Java Applications: such as databases, caches, messaging, the JVM itself, microservices, etc., and can also specify any class method to inject various complex experimental scenarios;

  • C++ Applications: such as injecting delays, variable and return value tampering into any specified method or line of code;

  • Docker Containers: such as killing containers, and performing experiments on CPU, memory, network, disk, process, etc. within containers;

  • Kubernetes Platform: such as performing experiments on CPU, memory, network, disk, process on nodes, as well as experiments on Pod network and Pod itself, like killing Pods, Pod IO exceptions, and experiments on containers as mentioned in the Docker container experiment scenarios;

  • Cloud Resources: such as Alibaba Cloud ECS downtime experiments.

The ChaosBlade community welcomes everyone to join us to discuss practices in the field of chaos engineering or any ideas and issues that arise during the use of ChaosBlade.

Author Introduction:

Ye Fei: GitHub @tiny-x, open-source community enthusiast, ChaosBlade Committer, involved in promoting the chaos engineering ecosystem of ChaosBlade.

Qiong Gu: GitHub @xcaspar, project leader of ChaosBlade, chaos engineering evangelist.

Source:

Alibaba Cloud Native (ID: Alicloudnative)

Previous Recommendations:

  • Microservices Tracking System

  • “The More You Work Hard, The Less Leadership Potential You Have”

  • The Academic Genealogy of Bitcoin

  • Business Stability Determines Life and Death: The Exploration and Practice of Youzan’s Business Reconciliation Platform

  • Discussion on Microservice Architecture (4): Domain Events – The Key to Decoupling Microservices

  • ……

Technical Tidbits

Based on distributed design, architecture, and system thinking, discussing various aspects related to R&D, including but not limited to code, quality systems, and R&D management. This account is maintained by the experienced technical team.

Exploring Fault Injection in Chaos Engineering

Long press to scan the code to follow

Exploring Fault Injection in Chaos Engineering
Click the original text to go directly to the project details page!

Leave a Comment