Ray Async Internals (2): Event Loop Observability and Chaos Testing
Part 2 of the Ray async infrastructure series. ← Part 1: Asio’s role and instrumented_io_context · Part 3: Thread pool and periodic timer →
instrumented_io_context adds two capabilities on top of plain Asio: lag monitoring (passive health check) and chaos delay injection (active fault simulation). This post covers both.
Lag Probing: LagProbeLoop
How It Works
LagProbeLoop measures event loop health by posting a probe task and recording how long it waits in the queue before executing:
void LagProbeLoop(instrumented_io_context &io_context, int64_t interval_ms, ...) {
auto begin = std::chrono::steady_clock::now();
io_context.post([begin, ...] {
auto end = std::chrono::steady_clock::now();
auto lag = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
io_context.io_context_event_loop_lag_ms_gauge_metric.Record(lag.count(), ...);
// if we're already behind, probe again immediately
auto delay = interval_ms - lag.count();
if (delay <= 0) {
LagProbeLoop(io_context, interval_ms, ...);
} else {
execute_after(io_context, [&] { LagProbeLoop(...); }, delay);
}
});
}
The elegance here: begin is captured at post time, end is measured inside the callback. The difference is exactly the queue wait time — not the execution time of other handlers, but the time this probe spent waiting behind them.
If the measured lag already exceeds interval_ms, the next probe fires immediately rather than waiting. This prevents the monitoring cadence from being swallowed by the very overload it’s trying to detect.
When It Starts
During instrumented_io_context construction, if emit_metrics == true, the first probe is posted (not dispatched — ensuring measurement starts only after io_context::run() is actually running).
The Metric
- Name:
io_context_event_loop_lag_ms - Type: Gauge (instantaneous value)
- Alert interpretation: Sustained high lag means the event loop is overloaded. A heartbeat task delayed by seconds looks like a dead node to GCS — triggering spurious failure recovery that cascades.
Chaos Delay Injection: asio_chaos
The Problem It Solves
A class of distributed system bugs only manifests under specific timing conditions that are hard to reproduce in tests: message A arrives slightly before message B, a timer fires while a lock is held, a callback executes just after an object is freed. Normal test runs don’t exercise these paths.
asio_chaos solves this by injecting random delays into post calls, artificially creating the timing jitter that exposes race conditions.
Configuration
export RAY_testing_asio_delay_us="Heartbeat=1000:2000,ObjectPull=50000:100000,*=0:100"
Format: MethodName=min_us:max_us. * is a wildcard matching anything not explicitly listed.
The example above:
Heartbeatcallbacks: random 1–2ms delayObjectPullcallbacks: random 50–100ms delay (simulates network congestion)- Everything else: random 0–100μs delay
Implementation
Inside instrumented_io_context::post:
int64_t extra_delay = ray::asio::testing::GetDelayUs(name);
// instead of posting immediately, schedule after the delay
execute_after(io_context, std::move(handler), extra_delay);
GetDelayUs reads from a global DelayManager singleton initialized from the environment variable. If the variable is unset, the function returns zero immediately — zero overhead in production.
When active, Ray logs an ERROR-level message: "Delaying method ..." — intentionally loud so you don’t forget to disable it before a production deployment.
The Two Together
LagProbeLoop → detects event loop overload passively
asio_chaos → creates overload deliberately to test behavior under it
One monitors; the other probes. Together they answer: “Is our async runtime healthy?” and “Does our code stay correct when it isn’t?”
Summary
LagProbeLoopis Ray’s built-in event loop health probe. Sustained high lag is an early warning of overload, not just a performance metric.asio_chaosinjects per-method random delays via environment variable. It’s the right tool for reproducing distributed timing bugs that only appear in production.- Both are zero-cost in production: the lag probe only runs when
emit_metrics=true; chaos injection only activates when the environment variable is set.