Concept of Tail Calls in Linux eBPF

Tail calls

A tail call is a mechanism that allows eBPF developers to split logic into multiple parts and jump from one part to another. Unlike traditional function calls, the control flow never returns to the code that initiated the tail call; it works more like a <span>goto</span> statement.

To use tail calls, developers need to add a <span>BPF_MAP_TYPE_PROG_ARRAY</span> type mapping in the program. Under certain conditions, this mapping can be populated with references to other programs. The program can then use the <span>bpf_tail_call</span> helper to invoke the actual tail call, passing in a reference to the mapping and an index.

A common use case for tail calls is to distribute “complexity” across multiple programs. Each target of a tail call is treated as an independent eBPF program that starts with an empty stack and has only one context located in the R1 register. Therefore, each program must independently pass the verifier’s checks and has its own complexity budget. Now, we can split the program into multiple parts, significantly increasing the program’s complexity.

Another use case is for replacing or extending logic. The contents of the program array can be replaced during its use. For example, updating the program version without interrupting service, or enabling/disabling certain logic.

Limitations

To prevent infinite loops or excessively long-running programs, the kernel limits the number of allowed tail calls per initial call to 32, meaning a total of up to 33 programs can be executed before the tail call helper refuses to continue jumping.

If a program array is associated with a certain program, any program added to that mapping should “match” the original program. This means they must have the same <span>prog_type</span> (program type), <span>expected_attach_type</span> (expected attach type), <span>attach_btf_id</span> (attached BTF ID), etc.

Although tail calls share the same stack frame, the verifier will prevent you from directly using uninitialized stack state or register contents. Therefore, there is no direct way to share state. A common solution to this problem is to use opaque fields in metadata (such as <span>__sk_buff->cb</span> or <span>xdp_md->data_meta</span> memory). Alternatively, a single-element per-CPU mapping can be used to pass data. This approach works because eBPF programs do not migrate to other CPU cores even between tail calls. However, in real-time (RT) kernels, eBPF programs may be interrupted and resumed later, so these mappings should only be shared between tail calls of the same task, not globally.

When tail calls are combined with BPF-to-BPF function calls, the stack size available to each program will be reduced from 512 bytes to 256 bytes. This is to limit the stack allocation required by the kernel, as explained in the following kernel comment:

To prevent stack overflow that may occur when BPF-to-BPF calls are combined with tail calls. In this case, the caller’s stack depth is limited to 256 bytes, so in the worst case, the stack size will reach 8KB (tail call limit of 32 * 256 bytes = 8KB).

To understand what might happen, consider the following example:

func1 -&gt; sub rsp, 128
 subfunc1 -&gt; sub rsp, 256
 tailcall1 -&gt; add rsp, 256  // Release subfunc1's stack, but func1's stack remains
  func2 -&gt; sub rsp, 192     // Total stack = 128 + 192 = 320
  subfunc2 -&gt; sub rsp, 64
  subfunc22 -&gt; sub rsp, 128
  tailcall2 -&gt; add rsp, 128   // Release subfunc22/subfunc2's stack
   func3 -&gt; sub rsp, 32       // Total stack = 128 + 192 + 64 + 32 = 416

As shown above, tail calls release the stack frame of the current function but do not clear the stack space of earlier functions in the call chain, so stack usage may continue to accumulate.

Source

https://docs.ebpf.io/linux/concepts/tail-calls/

Last updated: March 23, 2025

Leave a Comment