fix: recover stuck diagnostic sessions safely

This commit is contained in:
Peter Steinberger
2026-05-05 04:01:22 +01:00
parent c739088d62
commit 761e668acf
20 changed files with 825 additions and 79 deletions

View File

@@ -920,6 +920,7 @@ Notes:
enabled: true,
flags: ["telegram.*"],
stuckSessionWarnMs: 30000,
stuckSessionAbortMs: 600000,
otel: {
enabled: false,
@@ -959,6 +960,7 @@ Notes:
- `enabled`: master toggle for instrumentation output (default: `true`).
- `flags`: array of flag strings enabling targeted log output (supports wildcards like `"telegram.*"` or `"*"`).
- `stuckSessionWarnMs`: no-progress age threshold in ms for classifying long-running processing sessions as `session.long_running`, `session.stalled`, or `session.stuck`. Reply, tool, status, block, and ACP progress reset the timer; repeated `session.stuck` diagnostics back off while unchanged.
- `stuckSessionAbortMs`: no-progress age threshold in ms before eligible stalled active work may be abort-drained for recovery. When unset, OpenClaw uses the safer extended embedded-run window of at least 10 minutes and 5x `stuckSessionWarnMs`.
- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). For the full configuration, signal catalog, and privacy model, see [OpenTelemetry export](/gateway/opentelemetry).
- `otel.endpoint`: collector URL for OTel export.
- `otel.tracesEndpoint` / `otel.metricsEndpoint` / `otel.logsEndpoint`: optional signal-specific OTLP endpoints. When set, they override `otel.endpoint` for that signal only.

View File

@@ -216,11 +216,18 @@ OpenClaw classifies sessions by the work it can still observe:
still making progress.
- `session.stalled`: active work exists, but the active run has not reported
recent progress. Stalled embedded runs stay observe-only at first, then
abort-drain after at least 10 minutes and 5x `diagnostics.stuckSessionWarnMs`
with no progress so queued turns behind the lane can resume.
abort-drain after `diagnostics.stuckSessionAbortMs` with no progress so queued
turns behind the lane can resume. When unset, the abort threshold defaults to
the safer extended window of at least 10 minutes and 5x
`diagnostics.stuckSessionWarnMs`.
- `session.stuck`: stale session bookkeeping with no active work. This releases
the affected session lane immediately.
Recovery emits structured `session.recovery.requested` and
`session.recovery.completed` events. Diagnostic session state is marked idle
only after a mutating recovery outcome (`aborted` or `released`) and only if the
same processing generation is still current.
Only `session.stuck` emits the `openclaw.session.stuck` counter, the
`openclaw.session.stuck_age_ms` histogram, and the `openclaw.session.stuck`
span. Repeated `session.stuck` diagnostics back off while the session remains