yuqi-zheng

Ray Async Internals (3): Thread Pool and Periodic Timer


Part 3 of the Ray async infrastructure series. ← Part 2: Observability and chaos testing · Part 4: gRPC and Asio bridging →


Two components underpin almost all background work in Ray: IOServicePool, which manages a fleet of io_context threads, and PeriodicalRunner, which schedules repeating callbacks on top of them. This post covers both.


IOServicePool: Event Loop Thread Pool

Structure

┌─────────────────────────────────────────────┐
│               IOServicePool                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │io_ctx[0] │  │io_ctx[1] │  │io_ctx[N] │   │
│  │ Thread 0 │  │ Thread 1 │  │ Thread N │   │
│  └──────────┘  └──────────┘  └──────────┘   │
└─────────────────────────────────────────────┘

Each io_context runs on its own dedicated thread. Handlers posted to a given io_context execute serially on that thread — so code running inside a handler needs no locks, as long as it only touches state owned by that io_context.

Setup

IOServicePool pool(4);  // create 4 io_contexts, each on its own thread
pool.Run();             // start all threads (work_guard keeps each alive)

Two Dispatch Strategies

MethodStrategyUse case
Get()Round-robinStateless short tasks — spread load evenly
Get(hash)Hash-pinnedObjects requiring thread affinity

The hash-pinned strategy is the important one. When all callbacks for a given connection are routed to the same io_context, they execute in submission order on the same thread. No mutex needed — the io_context itself provides the serialization guarantee.

Example: if a gRPC connection has a 64-bit ID, pool.Get(connection_id) always returns the same io_context for that connection. All incoming requests from that connection process serially, which is exactly what you want for stateful protocol handling.


PeriodicalRunner: Periodic Task Scheduler

Basic Usage

auto runner = PeriodicalRunner::Create(io_context);
runner->RunFnPeriodically([] { SendHeartbeat(); }, /*interval_ms=*/1000, "Heartbeat");

Schedules SendHeartbeat to run every 1000ms on the given io_context.

Lifetime Safety

PeriodicalRunner inherits std::enable_shared_from_this<PeriodicalRunner> and captures weak_from_this() in its timer callback:

timer->async_wait(
    [weak_self = weak_from_this(), fn, period, timer](const auto& error) {
        if (auto self = weak_self.lock()) {
            if (error == boost::asio::error::operation_aborted) return;
            fn();
            self->DoRunFnPeriodically(fn, period, timer);
        }
        // lock() returned null — runner was destroyed, loop stops silently
    });

When the caller drops its shared_ptr<PeriodicalRunner>, the next timer fires, lock() returns null, and the periodic loop terminates — no crash, no leak, no explicit cancellation API needed.

Note that timer is also captured by value (as a shared_ptr). This keeps the timer alive until the final callback sees the runner is gone and exits. Without this, the timer could be destroyed before its last callback fires.

This is the weak_from_this pattern from the C++ lifetime safety series applied to a real production scheduler.

Instrumented Execution

When the global event_stats flag is enabled, PeriodicalRunner records the queue wait time between timer expiry and actual callback execution — adding another observability data point on top of the lag metric from Part 2.


How They Compose

IOServicePool::Get(hash) → instrumented_io_context*

                       PeriodicalRunner::Create(*io_context)

                       runner->RunFnPeriodically(...)

IOServicePool provides the threads; PeriodicalRunner is the clock that runs on one of them. Nearly all background periodic work in Ray — heartbeating, metric reporting, garbage collection — runs through this combination.


Summary

  • IOServicePool achieves lock-free concurrency: multiple threads, each running an independent io_context, with hash-pinning for thread affinity.
  • PeriodicalRunner uses weak_from_this() for safe lifetime management: when the owner is destroyed, running callbacks drain safely and the loop stops.
  • Together they are the scheduling substrate for Ray’s background infrastructure.

Next: gRPC and Asio bridging →