Using Failpoint for Fault Injection in Go

Recently, while looking at the TiDB source code, I found that it used failpoint for fault injection, which I found very interesting. It involves code generation and parsing the code AST tree for replacement to implement fault injection. I will also try to analyze it and learn how to parse the AST tree to generate code.

So, this article mainly explores the detailed usage of failpoint and its implementation principles.

Introduction

Failpoint is a tool for injecting errors during testing, and it is the Golang implementation of FreeBSD Failpoints. Typically, to enhance system stability, we have various testing scenarios, but some scenarios are very difficult to simulate, such as: random delays in a microservice, unavailability of a service; in game development, simulating unstable player networks, frame drops, excessive delays, etc.;

To conveniently test these issues, failpoint was created, which greatly simplifies our testing process, helping us simulate various errors in different scenarios to debug code bugs.

Failpoint has several main advantages:

Failpoint related code should have no additional overhead;
It should not affect normal functional logic and must not invade functional code;
Failpoint code must be easy to read, easy to write, and able to introduce compiler checks;
The final generated code must be readable;
In the generated code, the line numbers of functional logic code must not change (to facilitate debugging);

Usage

First, we need to build using the source code:

git clone https://github.com/pingcap/failpoint.git
cd failpoint
make
ls bin/failpoint-ctl

Translate the binary failpoint-ctl for code transformation.

Then, we can use failpoint in the code to inject faults:

package main

import "github.com/pingcap/failpoint"
import "fmt"

func test() {
    failpoint.Inject("testValue", func(v failpoint.Value) {
        fmt.Println(v)
    })
}

func main(){
    for i:=0;i<100;i++{
        test()
    }
}

We can see when we enter the Inject method:

func Inject(fpname string, fpbody interface{}) {}

When failpoint is not enabled, it is just an empty implementation and does not affect the performance of our business logic. When our service code is compiled and built, this piece of code will be inlined and optimized away, which is the zero-cost fault injection principle implemented by failpoint.

Next, we will convert the above test function into usable fault injection code:

$ failpoint/bin/failpoint-ctl enable .

Call the compiled failpoint-ctl to rewrite the current code:

package main

import (
 "fmt"
 "github.com/pingcap/failpoint"
)

func test() {
 if v, _err_ := failpoint.Eval(_curpkg_("testValue")); _err_ == nil {
  fmt.Println(v)
 }
}

func main() {
 for i := 0; i < 100; i++ {
  test()
 }
}

Next, we will perform the injection on the code:

$ GO_FAILPOINTS='main/testValue=2*return("abc")' go run main.go binding__failpoint_binding__.go 
abc
abc

In the above case, 2 indicates that the injection will only execute twice, and the parameter in return("abc") corresponds to the variable v obtained in the injection function.

Additionally, we can also set the activation probability:

$ GO_FAILPOINTS='main/testValue=5%return("abc")' go run main.go binding__failpoint_binding__.go 
abc
abc
abc
abc

In the above case, 5% indicates that it will only return abc with a 5% probability.

Besides the simple examples above, we can also use it to generate more complex scenarios:

package main

import (
 "fmt"
 "github.com/pingcap/failpoint"
 "math/rand"
)

func main() {
 failpoint.Label("outer")
 for i := 0; i < 100; i++ {
 failpoint.Label("inner")
  for j := 1; j < 1000; j++ {
   switch rand.Intn(j) + i {
   case j / 5:
    failpoint.Break()
   case j / 7:
    failpoint.Continue("outer")
   case j / 9:
    failpoint.Fallthrough()
   case j / 10:
    failpoint.Goto("outer")
   default:
    failpoint.Inject("failpoint-name", func(val failpoint.Value) {
     fmt.Println("unit-test", val.(int))
     if val == j/11 {
      failpoint.Break("inner")
     } else {
      failpoint.Goto("outer")
     }
    })
   }
  }
 } 
}

In this example, we used failpoint.Break, failpoint.Goto, failpoint.Continue, failpoint.Label to implement code jumps, and the final generated code:

func main() {
outer:
 for i := 0; i < 100; i++ {
 inner:
  for j := 1; j < 1000; j++ {
   switch rand.Intn(j) + i {
   case j / 5:
    break
   case j / 7:
    continue outer
   case j / 9:
    fallthrough
   case j / 10:
    goto outer
   default:
    if val, _err_ := failpoint.Eval(_curpkg_("failpoint-name")); _err_ == nil {
     fmt.Println("unit-test", val.(int))
     if val == j/11 {
      break inner
     } else {
      goto outer
     }
    }
   }
  }
 }
}

We can see that our failpoint code has all been transformed into Go language jump keywords.

After testing, we can finally restore the code using disable:

$ failpoint/bin/failpoint-ctl disable .

Other usage methods can be found in the official documentation:

https://github.com/pingcap/failpoint

Implementation Principles

Code Injection

Example Explanation

When using failpoint, we will use a series of Marker functions it provides to construct our fault points:

func Inject(fpname string, fpblock func(val Value)) {}
func InjectContext(fpname string, ctx context.Context, fpblock func(val Value)) {}
func Break(label ...string) {}
func Goto(label string) {}
func Continue(label ...string) {}
func Fallthrough() {}
func Return(results ...interface{}) {}
func Label(label string) {}

Then, through failpoint-ctl transformation, it constructs AST to replace marker statements, converting them into the final injected function code as follows:

package main

import (
 "fmt"
 "github.com/pingcap/failpoint"
)

func test() {
 failpoint.Inject("testPanic", func(val failpoint.Value){
  fmt.Println(val)
 })
}

func main() {
 for i := 0; i < 100; i++ {
  test()
 }
}

After conversion:

package main

import (
 "fmt"
 "github.com/pingcap/failpoint"
)

func test() {
 if val, _err_ := failpoint.Eval(_curpkg_("testPanic")); _err_ == nil {
  fmt.Println(val)
 }
}

func main() {
 for i := 0; i < 100; i++ {
  test()
 }
}

failpoint-ctl conversion not only replaces the code content but also generates a binding__failpoint_binding__.go file, which contains a _curpkg_ function to get the current package name:

package main

import "reflect"

type __failpointBindingType struct {pkgpath string}
var __failpointBindingCache = &__failpointBindingType{}

func init() {
 __failpointBindingCache.pkgpath = reflect.TypeOf(__failpointBindingType{}).PkgPath()
}
func _curpkg_(name string) string {
 return  __failpointBindingCache.pkgpath + "/" + name
}

Getting the Code AST Tree

When we call failpoint-ctl for code transformation, it rewrites the code through Rewriter. Rewriter is a tool structure that mainly traverses the code AST tree, detects Marker functions, and completes function replacement rewriting.

type Rewriter struct {
 rewriteDir    string // Rewrite path
 currentPath   string // File path
 currentFile   *ast.File // File AST tree
 currsetFset   *token.FileSet // FileSet
 failpointName string // Import renaming of failpoint
 rewritten     bool // Whether rewriting is complete
  
 output io.Writer // Redirect output
}

When failpoint-ctl executes, it calls the RewriteFile method for code rewriting:

func (r *Rewriter) RewriteFile(path string) (err error) {
 defer func() {
  if e := recover(); e != nil {
   err = fmt.Errorf("%s %v\n%s", r.currentPath, e, debug.Stack())
  }
 }()
 fset := token.NewFileSet();
 // Get the AST tree of the go file
 file, err := parser.ParseFile(fset, path, nil, parser.ParseComments)
 if err != nil {
  return err
 }
 if len(file.Decls) < 1 {
  return nil
 }
 // File path
 r.currentPath = path;
 // File AST tree
 r.currentFile = file;
 // File FileSet
 r.currsetFset = fset;
 // Mark whether rewriting is complete
 r.rewritten = false;
 // Get the failpoint import package
 var failpointImport *ast.ImportSpec;
 for _, imp := range file.Imports {
  if strings.Trim(imp.Path.Value, "`\"") == packagePath {
   failpointImport = imp;
   break
  }
 }
 if failpointImport == nil {
  panic("import path should be check before rewrite")
 }
 if failpointImport.Name != nil {
  r.failpointName = failpointImport.Name.Name;
 } else {
  r.failpointName = packageName;
 }
 // Traverse the top-level declarations in the file: such as type, function, import, global constants, etc.
 for _, decl := range file.Decls {
  fn, ok := decl.(*ast.FuncDecl);
  if !ok {
   continue;
  }
  // Traverse function declaration nodes and replace failpoint related functions
  if err := r.rewriteFuncDecl(fn); err != nil {
   return err;
  }
 }

 if !r.rewritten {
  return nil;
 }

 if r.output != nil {
  return format.Node(r.output, fset, file);
 }
 // Generate binding__failpoint_binding__ code
 found, err := isBindingFileExists(path);
 if err != nil {
  return err;
 }
 // If binding__failpoint_binding__.go file does not exist, regenerate one
 if !found {
  err := writeBindingFile(path, file.Name.Name);
  if err != nil {
   return err;
  }
 }
 // Rename the original file, such as renaming main.go to main.go__failpoint_stash__
 // To be used for restoration
 targetPath := path + failpointStashFileSuffix;
 if err := os.Rename(path, targetPath); err != nil {
  return err;
 }

 newFile, err := os.OpenFile(path, os.O_TRUNC|os.O_CREATE|os.O_WRONLY, os.ModePerm);
 if err != nil {
  return err;
 }
 defer newFile.Close();
 // Regenerate code file from constructed ast tree
 return format.Node(newFile, fset, file);
}

This method first calls the Go provided parser.ParseFile method to obtain the AST tree of the file. The AST tree represents the syntax structure of the source code using a tree structure, where each node of the tree represents a structure in the source code. Then it traverses the top-level declarations of this AST tree, which is equivalent to traversing from the top of the tree downwards, a depth-first traversal.

After traversal, it checks the binding__failpoint_binding__ file and backs up the source file before calling format.Node to rewrite the entire file.

Traversing the Code AST Tree to Get Rewriter Execution Node Replacement

func (r *Rewriter) rewriteStmts(stmts []ast.Stmt) error {
 // Traverse function body nodes
 for i, block := range stmts {
  switch v := block.(type) {
  case *ast.DeclStmt:
   ... 
  // Includes separate expression statements
  case *ast.ExprStmt:
   call, ok := v.X.(*ast.CallExpr);
   if !ok {
    break;
   }
   switch expr := call.Fun.(type) {
   // Function definition
   case *ast.FuncLit:
    // Recursively traverse function
    err := r.rewriteFuncLit(expr);
    if err != nil {
     return err;
    }
   // Select structure, similar to a.b structure
   case *ast.SelectorExpr:
    // Get the package name of the function call
    packageName, ok := expr.X.(*ast.Ident);
    // Check if the package name equals the failpoint package name
    if !ok || packageName.Name != r.failpointName {
     break;
    }
    // Get the Rewriter for the function through Marker name
    exprRewriter, found := exprRewriters[expr.Sel.Name];
    if !found {
     break;
    }
    // Rewrite the function
    rewritten, stmt, err := exprRewriter(r, call);
    if err != nil {
     return err;
    }
    if !rewritten {
     continue;
    }
    // Get the newly generated if node
    if ifStmt, ok := stmt.(*ast.IfStmt); ok {
     err := r.rewriteIfStmt(ifStmt);
     if err != nil {
      return err;
     }
    }
                // Replace the node with the newly generated if node
    stmts[i] = stmt;
    r.rewritten = true;
   }

  case *ast.AssignStmt:
   ... 
  case *ast.GoStmt:
   ...
  case *ast.DeferStmt:
   ...
  case *ast.ReturnStmt: 
  ... 
  default:
   fmt.Printf("unsupported statement: %T in %s\n", v, r.pos(v.Pos()));
  }
 } 
 return nil;
}

This will sequentially traverse all functions until it finds the failpoint Marker declaration, and then it will retrieve the corresponding Rewriter from exprRewriters based on the Marker name:

var exprRewriters = map[string]exprRewriter{
 "Inject":        (*Rewriter).rewriteInject,
 "InjectContext": (*Rewriter).rewriteInjectContext,
 "Break":         (*Rewriter).rewriteBreak,
 "Continue":      (*Rewriter).rewriteContinue,
 "Label":         (*Rewriter).rewriteLabel,
 "Goto":          (*Rewriter).rewriteGoto,
 "Fallthrough":   (*Rewriter).rewriteFallthrough,
 "Return":        (*Rewriter).rewriteReturn,
}

Rewriter Rewriting

In our example, we use failpoint.Inject, so we will explain using rewriteInject.

Through this method, it will ultimately transform:

 failpoint.Inject("testPanic", func(val failpoint.Value){
  fmt.Println(val)
 })

Into:

 if val, _err_ := failpoint.Eval(_curpkg_("testPanic")); _err_ == nil {
  fmt.Println(val)
 }

Now let’s see how to construct the AST tree:

func (r *Rewriter) rewriteInject(call *ast.CallExpr) (bool, ast.Stmt, error) {
 // Check if the function call failpoint.Inject is valid
 if len(call.Args) != 2 {
  return false, nil, fmt.Errorf("failpoint.Inject: expect 2 arguments but got %v in %s", len(call.Args), r.pos(call.Pos()));
 } 
 // Get the first argument "testPanic"
 fpname, ok := call.Args[0].(ast.Expr);
 if !ok {
  return false, nil, fmt.Errorf("failpoint.Inject: first argument expect a valid expression in %s", r.pos(call.Pos()));
 }
 
 // Get the second argument func(val failpoint.Value){}
 ident, ok := call.Args[1].(*ast.Ident);
 // Check if the second argument is nil
 isNilFunc := ok && ident.Name == "nil";

 // Check if the second argument is a function, as the second function argument can be null
    // failpoint.Inject("failpoint-name", func(){...})
 // failpoint.Inject("failpoint-name", func(val failpoint.Value){...})
 fpbody, isFuncLit := call.Args[1].(*ast.FuncLit);
 if !isNilFunc && !isFuncLit {
  return false, nil, fmt.Errorf("failpoint.Inject: second argument expect closure in %s", r.pos(call.Pos()));
 }
    
    // The second argument is a function
 if isFuncLit {
  if len(fpbody.Type.Params.List) > 1 {
   return false, nil, fmt.Errorf("failpoint.Inject: closure signature illegal in %s", r.pos(call.Pos()));
  }

  if len(fpbody.Type.Params.List) == 1 && len(fpbody.Type.Params.List[0].Names) > 1 {
   return false, nil, fmt.Errorf("failpoint.Inject: closure signature illegal in %s", r.pos(call.Pos()));
  }
 }
 // Construct the replacement function: _curpkg_("testPanic")
 fpnameExtendCall := &ast.CallExpr{
  Fun:  ast.NewIdent(extendPkgName),
  Args: []ast.Expr{fpname},
 };
 // Construct the function failpoint.Eval
 checkCall := &ast.CallExpr{
  Fun: &ast.SelectorExpr{
   X:   &ast.Ident{NamePos: call.Pos(), Name: r.failpointName},
   Sel: ast.NewIdent(evalFunction),
  },
  Args: []ast.Expr{fpnameExtendCall},
 };
 if isNilFunc || len(fpbody.Body.List) < 1 {
  return true, &ast.ExprStmt{X: checkCall}, nil;
 }
 // Construct if code block
 ifBody := &ast.BlockStmt{
  Lbrace: call.Pos(),
  List:   fpbody.Body.List,
  Rbrace: call.End(),
 };
 
 // Check if the closure function in failpoint contains parameters
    // func(val failpoint.Value) {...}
 // func() {...}
 var argName *ast.Ident;
 if len(fpbody.Type.Params.List) > 0 {
  arg := fpbody.Type.Params.List[0];
  selector, ok := arg.Type.(*ast.SelectorExpr);
  if !ok || selector.Sel.Name != "Value" || selector.X.(*ast.Ident).Name != r.failpointName {
   return false, nil, fmt.Errorf("failpoint.Inject: invalid signature in %s", r.pos(call.Pos()));
  }
  argName = arg.Names[0];
 } else {
  argName = ast.NewIdent("_");
 }
 // Construct the return value of failpoint.Eval
 err := ast.NewIdent("_err_");
 init := &ast.AssignStmt{
  Lhs: []ast.Expr{argName, err},
  Rhs: []ast.Expr{checkCall},
  Tok: token.DEFINE,
 };
 // Construct the if statement condition, which is _err_ == nil
 cond := &ast.BinaryExpr{
  X:  err,
  Op: token.EQL,
  Y:  ast.NewIdent("nil"),
 };
 // Construct the complete if code block
 stmt := &ast.IfStmt{
  If:   call.Pos(),
  Init: init,
  Cond: cond,
  Body: ifBody,
 };
 return true, stmt, nil;
}

The above comments should be detailed enough to follow along with the code.

Failpoint Execution

Constructing Fault Plans

For example, if we want this fault to have a 5% chance of being triggered, we can do this:

$ GO_FAILPOINTS='main/testValue=5%return("abc")' go run main.go binding__failpoint_binding__.go

The content declared in the GO_FAILPOINTS variable will be read during initialization, and the corresponding mechanism will be registered. During execution, the fault control will be performed based on the registered mechanism.

func init() {
 failpoints.reg = make(map[string]*Failpoint);
 // Get the GO_FAILPOINTS variable
 if s := os.Getenv("GO_FAILPOINTS"); len(s) > 0 { 
  // Split multiple values using ;
  for _, fp := range strings.Split(s, ";") {
   fpTerms := strings.Split(fp, "=");
   if len(fpTerms) != 2 {
    fmt.Printf("bad failpoint %q\n", fp);
    os.Exit(1);
   }
   // Register injection plan
   err := Enable(fpTerms[0], fpTerms[1]);
   if err != nil {
    fmt.Printf("bad failpoint %s\n", err);
    os.Exit(1);
   }
  }
 }
 if s := os.Getenv("GO_FAILPOINTS_HTTP"); len(s) > 0 {
  if err := serve(s); err != nil {
   fmt.Println(err);
   os.Exit(1);
  }
 }
}

The Enable function will eventually call the Failpoints structure’s Enable method. Let’s take a look at the Failpoints structure:

type Failpoints struct {
 mu  sync.RWMutex  // Concurrency control
 reg map[string]*Failpoint // Fault plan table
}

Failpoint struct {
    mu       sync.RWMutex  // Concurrency control
    t        *terms
    waitChan chan struct{} // Used for pausing
}

The Enable function will parse main/testValue=5% into a key-value form stored in the reg map, where the value will be parsed into the Failpoint structure.

The fault control plan in the Failpoint structure is mainly stored in the term structure:

type term struct {
 desc string // Plan description, here is 5%return("abc")

 mods mod // Plan type, whether it is fault probability control or fault count control, here is 5%
 act  actFunc // Fault behavior, here is return
 val  interface{} // Injected fault value, here is abc

 parent *terms
 fp     *Failpoint
}

We used return to execute the fault, but there are also six other options:

off: Take no action (does not trigger failpoint code)
return: Trigger failpoint with specified argument
sleep: Sleep the specified number of milliseconds
panic: Panic
break: Execute gdb and break into debugger
print: Print failpoint path for inject variable
pause: Pause will pause until the failpoint is disabled

The entire Failpoint hierarchy is as follows:

Using Failpoint for Fault Injection in Go

Next, let’s look at Enable:

func (fp *Failpoint) Enable(inTerms string) error {
 t, err := newTerms(inTerms, fp);
 if err != nil {
  return err;
 }
 fp.mu.Lock();
 fp.t = t;
 fp.waitChan = make(chan struct{});
 fp.mu.Unlock();
 return nil;
}

Enable mainly calls newTerms to build the terms structure:

func newTerms(desc string, fp *Failpoint) (*terms, error) {
    // Parse the incoming strategy
 chain, err := parse(desc, fp);
 if err != nil {
  return nil, err;
 }
 t := &terms{chain: chain, desc: desc};
 for _, c := range chain {
  c.parent = t;
 }
 return t, nil;
}

It parses the incoming strategy through parse and constructs terms to return.

Fault Execution

When we run the fault code, we will execute failpoint.Eval, and then determine whether to execute the fault function based on whether it returns an error.

The Eval function will call the Eval method of Failpoints:

func (fps *Failpoints) Eval(failpath string) (Value, error) {
 fps.mu.RLock();
    // Get the registered Failpoint
 fp, found := fps.reg[failpath];
 fps.mu.RUnlock();
 if !found {
  return nil, errors.Wrapf(ErrNotExist, "error on %s", failpath);
 }
 // Execute plan judgment
 val, err := fp.Eval();
 if err != nil {
  return nil, errors.Wrapf(err, "error on %s", failpath);
 }
 return val, nil;
}

The reg map in the Eval method is the plan registered in the init function mentioned above, which retrieves the Failpoint and calls its Eval method:

The Eval method will call the eval method of terms to traverse the chain []*term field, obtaining the set plan and calling the allow method to verify if it passes. If so, it will call the do method to execute the corresponding behavior.

Conclusion

In the above introduction, we first learned how to use Failpoint to serve our code, and then learned how Failpoint achieves fault injection through code injection. This includes traversing and modifying the Go AST tree, as well as code generation, which also provides us with an idea for writing code in the future, adding some additional functionality through this method of code generation.

Reference