yuqi-zheng

Compiler Optimization: Inlining —The Master Optimization


Most compiler optimizations work on a single function at a time. They rearrange instructions, eliminate redundant computations, and allocate registers efficiently —but only within the scope they can see. Inlining expands that scope. When a function call is replaced by a copy of the callee’s body, the optimizer gains visibility into what that code actually does, and suddenly a whole class of transformations becomes possible that were blocked before.

This is why inlining is often called the master optimization. It does not merely remove the cost of a call instruction; it is the prerequisite that enables nearly every other high-level optimization to apply.

What Inlining Actually Does

When a compiler inlines a function, it substitutes the call site with the function body, substituting formal parameters with the actual arguments. The call and ret instructions disappear, along with the associated stack frame setup. But this is the least interesting part of the benefit.

The more important effect is that the optimizer can now reason across what was previously a function boundary. Consider:

void process(bool flag) {
    if (flag)
        do_fast_path();
    else
        do_slow_path();
}

If do_fast_path and do_slow_path are not inlined, they are black boxes to the optimizer. The compiler must assume they can do anything: modify global state, call into other functions, read and write memory. It cannot reason about their contents, cannot eliminate computations that become redundant after the call, and cannot propagate the value of flag into either function to simplify their logic.

After inlining, all of that reasoning becomes possible.

The Chain of Consequences

Inlining tends to unlock a cascade of subsequent optimizations. The typical sequence looks like this:

Constant argument propagation. When the compiler sees the function body with the actual argument substituted in, a parameter that was a variable at the call site may become a literal constant inside the inlined body. An if (n == 0) inside the callee becomes if (4 == 0) after inlining a call with n=4, and the optimizer evaluates it immediately.

Dead code elimination. Once a condition is known to be always true or always false, the branch that cannot be taken is removed entirely. Code that was needed only for the eliminated branch —variables, allocations, loops —is deleted as well.

Loop simplification and vectorization. A loop inside the callee may contain conditional logic that depends on an argument. After constant folding and dead code elimination, that logic may disappear, leaving a simple loop body that the auto-vectorizer can transform into SIMD instructions.

Register allocation improvement. A function call creates a boundary: the compiler must assume the callee uses and modifies registers according to the calling convention. Values live in registers before a call must be spilled to the stack if they are needed afterward. After inlining, the combined code is one block, and the register allocator can work across the entire computation without these artificial spill points.

None of these consequences happen if the call is not inlined. The function body is opaque, the arguments are lost, and the optimizer sees only the call instruction.

How the Compiler Decides

Modern compilers use heuristic cost models to decide whether to inline a given call site. The factors typically considered include:

  • Callee size: smaller functions are almost always inlined; very large functions rarely are
  • Call frequency: calls in tight loops are weighted more heavily than calls executed once
  • Whether arguments are constants: inlining is more attractive when it enables constant folding
  • Presence of loops or recursion inside the callee: recursive functions cannot be inlined unconditionally
  • Target architecture characteristics: some architectures penalize branch mispredictions more than others, affecting the value of inlining

The inline keyword in C++ does not force inlining. It affects linkage —it allows a function to be defined in a header without violating the one-definition rule —but the compiler is free to ignore the inlining hint entirely and will do so if the cost model says it is a bad idea.

A fundamental limit of per-translation-unit compilation is that the compiler cannot inline a call to a function defined in another .cpp file. The function body is compiled separately and unavailable at the call site.

Link-Time Optimization (LTO) removes this restriction. With -flto on GCC or Clang, function bodies from all translation units are preserved in the object files in an intermediate representation, and the linker performs a whole-program optimization pass that can inline across module boundaries.

LTO can have substantial impact on programs structured with many small helper functions spread across multiple files. Enabling it requires no changes to source code.

Template Functions and Headers

C++ template functions are instantiated in each translation unit that uses them, and they are typically defined in header files. This means the compiler always has the function body available at call sites and can inline freely. The same applies to constexpr functions.

This is one reason why C++ standard library implementations put so much code in headers: inline-ability at the call site is a core part of achieving good performance from abstractions like std::vector, std::span, and std::string_view.

The Cost of Inlining

Inlining is not free. Duplicating function bodies increases code size, which can increase instruction cache pressure. A function called from many places that is inlined everywhere may make the overall binary larger in ways that slow things down more than the inlining helps.

The compiler tries to balance this automatically. Developers should resist manually marking functions __attribute__((always_inline)) or [[msvc::forceinline]] without profiling evidence that the compiler’s default decision is wrong. These annotations are occasionally necessary but more often reflect a misunderstanding of what is limiting performance.

Practical Guidelines

Keep functions small and focused. Small functions are easier to inline and the optimizer can do more with them after inlining. A function that does exactly one thing has fewer interactions with its context that could block transformations.

Put template and constexpr functions in headers. This ensures the compiler always has the body available. For non-template functions that are performance-critical and called from many places, consider putting them in headers as well, or enabling LTO to get cross-module inlining.

Enable LTO for production builds. On GCC and Clang, -flto enables whole-program inlining. The compilation and linking process takes longer, but the result often outperforms per-TU compilation meaningfully.

Use Profile-Guided Optimization (PGO) for large programs. PGO feeds runtime frequency data back into the compiler, letting it make better decisions about which call sites are worth inlining. Hot paths get aggressively inlined; cold paths are left as calls to save code size.

Do not trust the inline keyword to force inlining. If you need to verify that a specific function is inlined, look at the assembly output. Compiler Explorer makes this straightforward.

Summary

Inlining is best understood not as an optimization in itself but as the enabling condition for every other optimization. By copying a function’s body into its call sites, the compiler gains the global view it needs to fold constants, eliminate dead code, vectorize loops, and allocate registers across what was previously a hard boundary. Almost everything else discussed in a series on compiler optimizations either works better after inlining or only becomes possible because of it.