Ray Async Internals (1): Asio's Role and `instrumented_io

Part 1 of the Ray async infrastructure series. ← Series intro: Ray Core Architecture

The Non-Obvious Role of Asio in Ray

Most people reach for Boost.Asio when they need async network I/O. Ray uses it for something different: serializing business logic onto a single thread, eliminating locks from the most performance-critical paths.

The actual layering in Ray:

┌─────────────────────────────────┐
│ Business logic (GcsNodeManager) │ ← lock-free, single-threaded
├─────────────────────────────────┤
│ instrumented_io_context (Asio)  │ ← event queue + single-thread driver
├─────────────────────────────────┤
│ gRPC (CompletionQueue threads)  │ ← multi-threaded network I/O
└─────────────────────────────────┘

gRPC handles concurrent network I/O across many threads. When a request arrives, the gRPC thread doesn’t process it — it posts a lambda to an Asio io_context. A single dedicated thread drives that io_context, executing handlers one at a time. Business logic never runs concurrently, so it needs no locks.

`instrumented_io_context`

Ray wraps boost::asio::io_context to add observability:

class instrumented_io_context : public boost::asio::io_context {
public:
  instrumented_io_context(bool emit_metrics = false,
                          bool running_on_single_thread = false,
                          std::optional<std::string> context_name = std::nullopt);
  void post(std::function<void()> handler, std::string name, int64_t delay_us = 0);
  void dispatch(std::function<void()> handler, std::string name);
  std::shared_ptr<EventTracker> stats() const;
  ray::stats::Gauge io_context_event_loop_lag_ms_gauge_metric;
};

The additions over plain io_context:

A name for each context, shown in dashboards (e.g. "node_manager_io_context")
A lag gauge metric — the time between posting a handler and executing it
An event tracker for per-handler execution statistics
Support for the chaos delay injection covered in Part 2

Constructor Parameters

Parameter	Effect
`emit_metrics`	Enable lag probe and metric reporting
`running_on_single_thread`	Hint to Asio’s internal locking strategy
`context_name`	Name for metrics and dashboards

`post` vs `dispatch`

Both schedule a handler on the io_context, but with different timing guarantees:

Method	When it runs	Thread safety
`post`	Always enqueued, runs on the next event loop iteration	Fully thread-safe
`dispatch`	Runs immediately if already on the `io_context` thread; otherwise enqueues	Watch for deadlocks

// Assume we're currently executing inside the io_context thread:
io_context.post([] { DoSomething(); });     // queued — runs next iteration
io_context.dispatch([] { DoSomething(); }); // runs immediately, inline

Use post from any thread — it’s always safe. Use dispatch as an optimization when you know you’re already on the right thread and want to avoid the queuing overhead.

Calling dispatch from outside the io_context thread is safe but behaves identically to post. The risk is calling it recursively in a way that starves other handlers.

Keeping the Event Loop Alive: `executor_work_guard`

io_context::run() returns as soon as there are no pending handlers. For long-running services (GCS, Raylet, CoreWorker), that’s a problem — the event loop needs to stay alive indefinitely.

The solution is executor_work_guard:

boost::asio::executor_work_guard<boost::asio::io_context::executor_type> work(
    io_context.get_executor());
io_context.run();  // blocks until work is destroyed or io_context::stop() is called

work_guard injects a permanent “outstanding work item” into the io_context. Until the guard is destroyed, run() never returns voluntarily. All long-lived Ray services use this pattern.

To shut down cleanly: destroy the work_guard (or call io_context.stop()), then wait for run() to return.

Summary

Ray uses Asio as a serialization mechanism, not a network engine.
instrumented_io_context adds metrics and naming without changing Asio semantics.
post is the safe default for cross-thread dispatch; dispatch avoids queuing overhead when already on the right thread.
work_guard is the keepalive pattern used by every long-running Ray service.

Next: Event loop observability and chaos testing →

Ray Async Internals (1): Asio's Role and `instrumented_io_context`

The Non-Obvious Role of Asio in Ray

instrumented_io_context

Constructor Parameters

post vs dispatch

Keeping the Event Loop Alive: executor_work_guard

Summary

`instrumented_io_context`

`post` vs `dispatch`

Keeping the Event Loop Alive: `executor_work_guard`