Ray Async Internals (2): Event Loop Observability and Chaos Testing

Part 2 of the Ray async infrastructure series. ← Part 1: Asio’s role and instrumented_io_context · Part 3: Thread pool and periodic timer →

instrumented_io_context adds two capabilities on top of plain Asio: lag monitoring (passive health check) and chaos delay injection (active fault simulation). This post covers both.

Lag Probing: `LagProbeLoop`

How It Works

LagProbeLoop measures event loop health by posting a probe task and recording how long it waits in the queue before executing:

void LagProbeLoop(instrumented_io_context &io_context, int64_t interval_ms, ...) {
  auto begin = std::chrono::steady_clock::now();
  io_context.post([begin, ...] {
    auto end = std::chrono::steady_clock::now();
    auto lag = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
    io_context.io_context_event_loop_lag_ms_gauge_metric.Record(lag.count(), ...);
    // if we're already behind, probe again immediately
    auto delay = interval_ms - lag.count();
    if (delay <= 0) {
      LagProbeLoop(io_context, interval_ms, ...);
    } else {
      execute_after(io_context, [&] { LagProbeLoop(...); }, delay);
    }
  });
}

The elegance here: begin is captured at post time, end is measured inside the callback. The difference is exactly the queue wait time — not the execution time of other handlers, but the time this probe spent waiting behind them.

If the measured lag already exceeds interval_ms, the next probe fires immediately rather than waiting. This prevents the monitoring cadence from being swallowed by the very overload it’s trying to detect.

When It Starts

During instrumented_io_context construction, if emit_metrics == true, the first probe is posted (not dispatched — ensuring measurement starts only after io_context::run() is actually running).

The Metric

Name: io_context_event_loop_lag_ms
Type: Gauge (instantaneous value)
Alert interpretation: Sustained high lag means the event loop is overloaded. A heartbeat task delayed by seconds looks like a dead node to GCS — triggering spurious failure recovery that cascades.

Chaos Delay Injection: `asio_chaos`

The Problem It Solves

A class of distributed system bugs only manifests under specific timing conditions that are hard to reproduce in tests: message A arrives slightly before message B, a timer fires while a lock is held, a callback executes just after an object is freed. Normal test runs don’t exercise these paths.

asio_chaos solves this by injecting random delays into post calls, artificially creating the timing jitter that exposes race conditions.

Configuration

export RAY_testing_asio_delay_us="Heartbeat=1000:2000,ObjectPull=50000:100000,*=0:100"

Format: MethodName=min_us:max_us. * is a wildcard matching anything not explicitly listed.

The example above:

Heartbeat callbacks: random 1–2ms delay
ObjectPull callbacks: random 50–100ms delay (simulates network congestion)
Everything else: random 0–100μs delay

Implementation

Inside instrumented_io_context::post:

int64_t extra_delay = ray::asio::testing::GetDelayUs(name);
// instead of posting immediately, schedule after the delay
execute_after(io_context, std::move(handler), extra_delay);

GetDelayUs reads from a global DelayManager singleton initialized from the environment variable. If the variable is unset, the function returns zero immediately — zero overhead in production.

When active, Ray logs an ERROR-level message: "Delaying method ..." — intentionally loud so you don’t forget to disable it before a production deployment.

The Two Together

LagProbeLoop   → detects event loop overload passively
asio_chaos     → creates overload deliberately to test behavior under it

One monitors; the other probes. Together they answer: “Is our async runtime healthy?” and “Does our code stay correct when it isn’t?”

Summary

LagProbeLoop is Ray’s built-in event loop health probe. Sustained high lag is an early warning of overload, not just a performance metric.
asio_chaos injects per-method random delays via environment variable. It’s the right tool for reproducing distributed timing bugs that only appear in production.
Both are zero-cost in production: the lag probe only runs when emit_metrics=true; chaos injection only activates when the environment variable is set.

Next: Thread pool and periodic timer →