feat: add gateway stall diagnostics

This commit is contained in:
Peter Steinberger
2026-05-04 22:34:59 +01:00
parent 358cd87ff3
commit e84d4b27f4
13 changed files with 401 additions and 14 deletions

View File

@@ -117,12 +117,19 @@ diagnostics are enabled. It is for operational facts, not content.
The same diagnostic heartbeat records liveness samples when the Gateway keeps
running but the Node.js event loop or CPU looks saturated. These
`diagnostic.liveness.warning` events include event-loop delay, event-loop
utilization, CPU-core ratio, and active/waiting/queued session counts. Idle
samples stay in telemetry at `info` level. Liveness samples become Gateway
warnings only when work is waiting or queued, or when active work overlaps with
sustained event-loop delay. Transient max-delay spikes during otherwise healthy
background work stay in debug logs. They do not restart the Gateway by
themselves.
utilization, CPU-core ratio, active/waiting/queued session counts, the current
startup/runtime phase when known, recent phase spans, and bounded active/queued
work labels. Idle samples stay in telemetry at `info` level. Liveness samples
become Gateway warnings only when work is waiting or queued, or when active work
overlaps with sustained event-loop delay. Transient max-delay spikes during
otherwise healthy background work stay in debug logs. They do not restart the
Gateway by themselves.
Startup phases also emit `diagnostic.phase.completed` events with wall-clock and
CPU timing. Stalled embedded-run diagnostics mark `terminalProgressStale=true`
when the last bridge progress looked terminal, such as a raw response item or
response completion event, but the Gateway still considers the embedded run
active.
Inspect the live recorder: