feat: add gateway stall diagnostics

This commit is contained in:
Peter Steinberger
2026-05-04 22:34:59 +01:00
parent 358cd87ff3
commit e84d4b27f4
13 changed files with 401 additions and 14 deletions

View File

@@ -117,12 +117,19 @@ diagnostics are enabled. It is for operational facts, not content.
The same diagnostic heartbeat records liveness samples when the Gateway keeps
running but the Node.js event loop or CPU looks saturated. These
`diagnostic.liveness.warning` events include event-loop delay, event-loop
utilization, CPU-core ratio, and active/waiting/queued session counts. Idle
samples stay in telemetry at `info` level. Liveness samples become Gateway
warnings only when work is waiting or queued, or when active work overlaps with
sustained event-loop delay. Transient max-delay spikes during otherwise healthy
background work stay in debug logs. They do not restart the Gateway by
themselves.
utilization, CPU-core ratio, active/waiting/queued session counts, the current
startup/runtime phase when known, recent phase spans, and bounded active/queued
work labels. Idle samples stay in telemetry at `info` level. Liveness samples
become Gateway warnings only when work is waiting or queued, or when active work
overlaps with sustained event-loop delay. Transient max-delay spikes during
otherwise healthy background work stay in debug logs. They do not restart the
Gateway by themselves.
Startup phases also emit `diagnostic.phase.completed` events with wall-clock and
CPU timing. Stalled embedded-run diagnostics mark `terminalProgressStale=true`
when the last bridge progress looked terminal, such as a raw response item or
response completion event, but the Gateway still considers the embedded run
active.
Inspect the live recorder:

View File

@@ -89,6 +89,13 @@ OPENCLAW_RUN_NODE_CPU_PROF_DIR=.artifacts/cli-cpu pnpm openclaw status
The source runner adds Node CPU profile flags and writes a `.cpuprofile` for the
command. Use this before adding temporary instrumentation to command code.
For startup stalls that look like synchronous filesystem or module-loader work,
add Node's sync I/O trace flag through the source runner:
```bash
OPENCLAW_TRACE_SYNC_IO=1 pnpm openclaw gateway --force
```
## Gateway watch mode
For fast iteration, run the gateway under the file watcher: