yuqi-zheng

Ray Async Internals (1): Asio's Role and `instrumented_io_context`


Part 1 of the Ray async infrastructure series. ← Series intro: Ray Core Architecture


The Non-Obvious Role of Asio in Ray

Most people reach for Boost.Asio when they need async network I/O. Ray uses it for something different: serializing business logic onto a single thread, eliminating locks from the most performance-critical paths.

The actual layering in Ray:

┌─────────────────────────────────┐
│ Business logic (GcsNodeManager) │ ← lock-free, single-threaded
├─────────────────────────────────┤
│ instrumented_io_context (Asio)  │ ← event queue + single-thread driver
├─────────────────────────────────┤
│ gRPC (CompletionQueue threads)  │ ← multi-threaded network I/O
└─────────────────────────────────┘

gRPC handles concurrent network I/O across many threads. When a request arrives, the gRPC thread doesn’t process it — it posts a lambda to an Asio io_context. A single dedicated thread drives that io_context, executing handlers one at a time. Business logic never runs concurrently, so it needs no locks.


instrumented_io_context

Ray wraps boost::asio::io_context to add observability:

class instrumented_io_context : public boost::asio::io_context {
public:
  instrumented_io_context(bool emit_metrics = false,
                          bool running_on_single_thread = false,
                          std::optional<std::string> context_name = std::nullopt);
  void post(std::function<void()> handler, std::string name, int64_t delay_us = 0);
  void dispatch(std::function<void()> handler, std::string name);
  std::shared_ptr<EventTracker> stats() const;
  ray::stats::Gauge io_context_event_loop_lag_ms_gauge_metric;
};

The additions over plain io_context:

  • A name for each context, shown in dashboards (e.g. "node_manager_io_context")
  • A lag gauge metric — the time between posting a handler and executing it
  • An event tracker for per-handler execution statistics
  • Support for the chaos delay injection covered in Part 2

Constructor Parameters

ParameterEffect
emit_metricsEnable lag probe and metric reporting
running_on_single_threadHint to Asio’s internal locking strategy
context_nameName for metrics and dashboards

post vs dispatch

Both schedule a handler on the io_context, but with different timing guarantees:

MethodWhen it runsThread safety
postAlways enqueued, runs on the next event loop iterationFully thread-safe
dispatchRuns immediately if already on the io_context thread; otherwise enqueuesWatch for deadlocks
// Assume we're currently executing inside the io_context thread:
io_context.post([] { DoSomething(); });     // queued — runs next iteration
io_context.dispatch([] { DoSomething(); }); // runs immediately, inline

Use post from any thread — it’s always safe. Use dispatch as an optimization when you know you’re already on the right thread and want to avoid the queuing overhead.

Calling dispatch from outside the io_context thread is safe but behaves identically to post. The risk is calling it recursively in a way that starves other handlers.


Keeping the Event Loop Alive: executor_work_guard

io_context::run() returns as soon as there are no pending handlers. For long-running services (GCS, Raylet, CoreWorker), that’s a problem — the event loop needs to stay alive indefinitely.

The solution is executor_work_guard:

boost::asio::executor_work_guard<boost::asio::io_context::executor_type> work(
    io_context.get_executor());
io_context.run();  // blocks until work is destroyed or io_context::stop() is called

work_guard injects a permanent “outstanding work item” into the io_context. Until the guard is destroyed, run() never returns voluntarily. All long-lived Ray services use this pattern.

To shut down cleanly: destroy the work_guard (or call io_context.stop()), then wait for run() to return.


Summary

  • Ray uses Asio as a serialization mechanism, not a network engine.
  • instrumented_io_context adds metrics and naming without changing Asio semantics.
  • post is the safe default for cross-thread dispatch; dispatch avoids queuing overhead when already on the right thread.
  • work_guard is the keepalive pattern used by every long-running Ray service.

Next: Event loop observability and chaos testing →