mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 07:50:43 +00:00
fix: classify session liveness diagnostics
This commit is contained in:
@@ -164,7 +164,7 @@ surfaces, while Codex native hooks remain a separate lower-level Codex mechanism
|
||||
- `agent.wait` default: 30s (just the wait). `timeoutMs` param overrides.
|
||||
- Agent runtime: `agents.defaults.timeoutSeconds` default 172800s (48 hours); enforced in `runEmbeddedPiAgent` abort timer.
|
||||
- Cron runtime: isolated agent-turn `timeoutSeconds` is owned by cron. The scheduler starts that timer when execution begins, aborts the underlying run at the configured deadline, then runs bounded cleanup before recording the timeout so a stale child session cannot keep the lane stuck.
|
||||
- Stuck-session recovery: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` detects long `processing` sessions. Active embedded runs, active reply operations, and active session-lane tasks remain warning-only by default; if diagnostics show no active work for the session, the watchdog releases the affected session lane so queued startup work can drain.
|
||||
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path releases the affected session lane so queued startup work can drain.
|
||||
- Model idle timeout: OpenClaw aborts a model request when no response chunks arrive before the idle window. `models.providers.<id>.timeoutSeconds` extends this idle watchdog for slow local/self-hosted providers; otherwise OpenClaw uses `agents.defaults.timeoutSeconds` when configured, capped at 120s by default. Cron-triggered runs with no explicit model or agent timeout disable the idle watchdog and rely on the cron outer timeout.
|
||||
- Provider HTTP request timeout: `models.providers.<id>.timeoutSeconds` applies to that provider's model HTTP fetches, including connect, headers, body, SDK request timeout, total guarded-fetch abort handling, and model stream idle watchdog. Use this for slow local/self-hosted providers such as Ollama before raising the whole agent runtime timeout.
|
||||
|
||||
|
||||
@@ -115,7 +115,7 @@ keys.
|
||||
- If commands seem stuck, enable verbose logs and look for “queued for …ms” lines to confirm the queue is draining.
|
||||
- If you need queue depth, enable verbose logs and watch for queue timing lines.
|
||||
- Codex app-server runs that accept a turn and then stop emitting progress are interrupted by the Codex adapter so the active session lane can release instead of waiting for the outer run timeout.
|
||||
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` log a stuck-session warning. Active embedded runs, active reply operations, and active lane tasks remain warning-only by default; stale startup bookkeeping with no active session work can release the affected session lane so queued work drains.
|
||||
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path can release the affected session lane so queued work drains.
|
||||
|
||||
## Related
|
||||
|
||||
|
||||
@@ -937,7 +937,7 @@ Notes:
|
||||
|
||||
- `enabled`: master toggle for instrumentation output (default: `true`).
|
||||
- `flags`: array of flag strings enabling targeted log output (supports wildcards like `"telegram.*"` or `"*"`).
|
||||
- `stuckSessionWarnMs`: age threshold in ms for emitting stuck-session warnings while a session remains in processing state.
|
||||
- `stuckSessionWarnMs`: age threshold in ms for classifying long-running processing sessions as `session.long_running`, `session.stalled`, or `session.stuck`.
|
||||
- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). For the full configuration, signal catalog, and privacy model, see [OpenTelemetry export](/gateway/opentelemetry).
|
||||
- `otel.endpoint`: collector URL for OTel export.
|
||||
- `otel.tracesEndpoint` / `otel.metricsEndpoint` / `otel.logsEndpoint`: optional signal-specific OTLP endpoints. When set, they override `otel.endpoint` for that signal only.
|
||||
|
||||
@@ -192,8 +192,8 @@ When any subkey is enabled, model and tool spans get bounded, redacted
|
||||
- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or `openclaw.channel=heartbeat`)
|
||||
- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`)
|
||||
- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`)
|
||||
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`)
|
||||
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`)
|
||||
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`; emitted only for stale session bookkeeping with no active work)
|
||||
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`; emitted only for stale session bookkeeping with no active work)
|
||||
- `openclaw.run.attempt` (counter, attrs: `openclaw.attempt`)
|
||||
|
||||
### Harness lifecycle
|
||||
@@ -277,8 +277,8 @@ to them directly without OTLP export.
|
||||
**Queue and session**
|
||||
|
||||
- `queue.lane.enqueue` / `queue.lane.dequeue`
|
||||
- `session.state` / `session.stuck`
|
||||
- `run.attempt`
|
||||
- `session.state` / `session.long_running` / `session.stalled` / `session.stuck`
|
||||
- `run.attempt` / `run.progress`
|
||||
- `diagnostic.heartbeat` (aggregate counters: webhooks/queue/session)
|
||||
|
||||
**Harness lifecycle**
|
||||
|
||||
Reference in New Issue
Block a user