yuqi-zheng

Compiler Optimization: Tail Call Optimization


A tail call is a function call that occurs as the very last action of a function —there is nothing left to do after it returns. When a compiler recognizes this pattern, it can replace the call with a jump, reusing the current stack frame rather than allocating a new one. The result is that a recursive function can execute with the same stack depth as a loop.

The Mechanics: GCD

The Euclidean algorithm for greatest common divisor is a clean example:

unsigned int gcd(unsigned int a, unsigned int b) {
    if (b == 0) return a;
    return gcd(b, a % b);
}

The recursive call gcd(b, a % b) is a tail call: gcd does nothing after the call returns, it simply passes the result back. A compiler can recognize that the current frame is no longer needed at the point of that call.

GCC and Clang generate something like this at -O2:

gcd:
    mov eax, edi          ; eax = a
    test esi, esi
    je .done              ; if b == 0, return a
    mov edx, esi
.loop:
    mov ecx, edx          ; ecx = b
    xor edx, edx
    div ecx               ; eax = a / b, edx = a % b
    mov eax, ecx          ; a = b
    test edx, edx
    jne .loop             ; if remainder != 0, continue
    mov eax, ecx
.done:
    ret

There is no call instruction. The recursion has been entirely replaced by a loop. Stack depth stays constant regardless of how many iterations the algorithm requires.

Conditions for TCO

Not every recursive call qualifies. The compiler can only apply tail call optimization when:

  1. The call is the last operation before the function returns.
  2. The return value of the callee is returned directly (not transformed).
  3. The current frame’s local variables are not needed after the call.

The cleaner the structure, the more likely TCO applies. Adding computation after the recursive call —even something as small as return 1 + gcd(b, a % b) —makes the call non-tail, because the addition needs to happen after the recursive call returns.

Tail Calls to Other Functions

TCO applies to calls to other functions, not just self-recursion. If function f ends with return g(x), and g has the same or compatible calling convention, the compiler can generate a jmp to g rather than a call. The caller of f sees g’s return value directly.

This is useful in state machines and coroutine-like patterns, where control transfers between handlers without the stack growing.

Guaranteeing TCO: clang::musttail

Compilers apply TCO as an optional optimization. In debug builds or with optimization disabled, it typically does not happen. If your program’s correctness depends on tail calls being eliminated —for example, a state machine with deeply recursive transfer chains —you need a guarantee.

Clang provides the [[clang::musttail]] attribute:

[[clang::musttail]] return next_handler(state);

If the compiler cannot apply TCO to this call, it emits an error rather than silently generating a regular call. This is a compile-time safety net for code where stack overflow would be a correctness issue, not just a performance one.

Parameter Order Affects Efficiency

When a function tail-calls itself with modified arguments, the compiler needs to update the argument registers before jumping. If the new arguments happen to be in the right registers already, the jump is free. If not, the compiler must shuffle them, which may require temporaries or additional moves.

For the GCD example, the two arguments are b and a % b. After the division, b is in ecx and the remainder is in edx. The compiler needs to put these in edi and esi for the next iteration. This adds a small amount of register movement, but it is still far cheaper than a function call and stack frame allocation.

When designing recursive functions that you expect to be tail-call optimized, consider whether the argument order can be arranged to minimize register shuffling.

Application: Interpreter Dispatch

A more advanced use of TCO appears in high-performance interpreters and CPU emulators. The conventional structure for a bytecode interpreter is:

while (running) {
    switch (fetch_opcode()) {
        case OP_ADD: execute_add(); break;
        case OP_JUMP: execute_jump(); break;
        // ...
    }
}

This has a single indirect branch point —the switch —which must predict among all possible opcodes. A branch predictor sees a stream of wildly varying targets at a single program counter, which is difficult to predict well.

An alternative using TCO:

void execute_add(State* s) {
    // perform add
    [[clang::musttail]] return dispatch(s);
}

void execute_jump(State* s) {
    // perform jump
    [[clang::musttail]] return dispatch(s);
}

Where dispatch fetches the next opcode and tail-calls the appropriate handler. With TCO applied, each handler ends with an indirect jump from its own call site. The branch predictor now sees one indirect branch per opcode handler —and crucially, each has its own prediction history. A CMP instruction that is almost always followed by BNE lets the predictor for the execute_cmp handler specialize for that pattern.

This technique is sometimes called “threaded code” or “computed goto dispatch” and is used in implementations of Python, Ruby, and various emulators for this reason.

Summary

Tail call optimization converts recursive structure into iterative execution, eliminating stack growth and enabling the same CPU-level optimizations available to loops. It applies automatically when the call structure qualifies, and can be enforced with [[clang::musttail]] when correctness demands it. Beyond simple recursion elimination, TCO enables architectural patterns —like per-opcode dispatch in interpreters —that would otherwise require manual assembly or platform-specific tricks.