mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 11:40:42 +00:00
fix: recover stuck diagnostic sessions safely
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
b4491e9b8ea5606cad18c1acf06f03d35301ebec1974d201ec9ee7582d2f6001 config-baseline.json
|
||||
9c0c9369d49c2001f91ec030e3852ccdc2ac9084229f335804aa9141c13b4795 config-baseline.core.json
|
||||
657060e80f3dc4b7d992e8625d2a8b0ff9b1b408960148d3f5f6a381d602359a config-baseline.json
|
||||
92cbb12ca382f7424e7bd52df21798b10a57621f5c266909fa74e23f6cb973d7 config-baseline.core.json
|
||||
cd7c0c7fb1435bc7e59099e9ac334462d5ad444016e9ab4512aae63a238f78dc config-baseline.channel.json
|
||||
9832b30a696930a3da7efccf38073137571e1b66cae84e54d747b733fdafcc54 config-baseline.plugin.json
|
||||
|
||||
@@ -165,7 +165,7 @@ surfaces, while Codex native hooks remain a separate lower-level Codex mechanism
|
||||
- `agent.wait` default: 30s (just the wait). `timeoutMs` param overrides.
|
||||
- Agent runtime: `agents.defaults.timeoutSeconds` default 172800s (48 hours); enforced in `runEmbeddedPiAgent` abort timer.
|
||||
- Cron runtime: isolated agent-turn `timeoutSeconds` is owned by cron. The scheduler starts that timer when execution begins, aborts the underlying run at the configured deadline, then runs bounded cleanup before recording the timeout so a stale child session cannot keep the lane stuck.
|
||||
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work. Stale session bookkeeping releases the affected session lane immediately; stalled embedded runs are abort-drained only after an extended no-progress window (at least 10 minutes and 5x the warning threshold) so queued work can resume without cutting off merely slow runs. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
|
||||
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work. Stale session bookkeeping releases the affected session lane immediately; stalled embedded runs are abort-drained only after `diagnostics.stuckSessionAbortMs` (default: at least 10 minutes and 5x the warning threshold) so queued work can resume without cutting off merely slow runs. Recovery emits structured requested/completed outcomes, and diagnostic state is marked idle only if the same processing generation is still current. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
|
||||
- Model idle timeout: OpenClaw aborts a model request when no response chunks arrive before the idle window. `models.providers.<id>.timeoutSeconds` extends this idle watchdog for slow local/self-hosted providers; otherwise OpenClaw uses `agents.defaults.timeoutSeconds` when configured, capped at 120s by default. Cron-triggered runs with no explicit model or agent timeout disable the idle watchdog and rely on the cron outer timeout.
|
||||
- Provider HTTP request timeout: `models.providers.<id>.timeoutSeconds` applies to that provider's model HTTP fetches, including connect, headers, body, SDK request timeout, total guarded-fetch abort handling, and model stream idle watchdog. Use this for slow local/self-hosted providers such as Ollama before raising the whole agent runtime timeout.
|
||||
|
||||
|
||||
@@ -920,6 +920,7 @@ Notes:
|
||||
enabled: true,
|
||||
flags: ["telegram.*"],
|
||||
stuckSessionWarnMs: 30000,
|
||||
stuckSessionAbortMs: 600000,
|
||||
|
||||
otel: {
|
||||
enabled: false,
|
||||
@@ -959,6 +960,7 @@ Notes:
|
||||
- `enabled`: master toggle for instrumentation output (default: `true`).
|
||||
- `flags`: array of flag strings enabling targeted log output (supports wildcards like `"telegram.*"` or `"*"`).
|
||||
- `stuckSessionWarnMs`: no-progress age threshold in ms for classifying long-running processing sessions as `session.long_running`, `session.stalled`, or `session.stuck`. Reply, tool, status, block, and ACP progress reset the timer; repeated `session.stuck` diagnostics back off while unchanged.
|
||||
- `stuckSessionAbortMs`: no-progress age threshold in ms before eligible stalled active work may be abort-drained for recovery. When unset, OpenClaw uses the safer extended embedded-run window of at least 10 minutes and 5x `stuckSessionWarnMs`.
|
||||
- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). For the full configuration, signal catalog, and privacy model, see [OpenTelemetry export](/gateway/opentelemetry).
|
||||
- `otel.endpoint`: collector URL for OTel export.
|
||||
- `otel.tracesEndpoint` / `otel.metricsEndpoint` / `otel.logsEndpoint`: optional signal-specific OTLP endpoints. When set, they override `otel.endpoint` for that signal only.
|
||||
|
||||
@@ -216,11 +216,18 @@ OpenClaw classifies sessions by the work it can still observe:
|
||||
still making progress.
|
||||
- `session.stalled`: active work exists, but the active run has not reported
|
||||
recent progress. Stalled embedded runs stay observe-only at first, then
|
||||
abort-drain after at least 10 minutes and 5x `diagnostics.stuckSessionWarnMs`
|
||||
with no progress so queued turns behind the lane can resume.
|
||||
abort-drain after `diagnostics.stuckSessionAbortMs` with no progress so queued
|
||||
turns behind the lane can resume. When unset, the abort threshold defaults to
|
||||
the safer extended window of at least 10 minutes and 5x
|
||||
`diagnostics.stuckSessionWarnMs`.
|
||||
- `session.stuck`: stale session bookkeeping with no active work. This releases
|
||||
the affected session lane immediately.
|
||||
|
||||
Recovery emits structured `session.recovery.requested` and
|
||||
`session.recovery.completed` events. Diagnostic session state is marked idle
|
||||
only after a mutating recovery outcome (`aborted` or `released`) and only if the
|
||||
same processing generation is still current.
|
||||
|
||||
Only `session.stuck` emits the `openclaw.session.stuck` counter, the
|
||||
`openclaw.session.stuck_age_ms` histogram, and the `openclaw.session.stuck`
|
||||
span. Repeated `session.stuck` diagnostics back off while the session remains
|
||||
|
||||
Reference in New Issue
Block a user