* fix(diagnostics): clear embedded-run activity when recovery declares lane idle
Stuck-session recovery transitions a lane to idle via the recovery
coordinator, but only mutated the session-state store. When an aborted
embedded run was removed without markDiagnosticEmbeddedRunEnded, the
activity store kept hasActiveEmbeddedRun set, so the liveness sweep
reported idle/embedded_run and isIdleQueuedRecoverableSessionStall
re-triggered recovery indefinitely.
Reconcile the activity store from the authoritative idle declaration by
clearing the session's embedded-run owners. The existing generation
guard already excludes any newer run that re-armed activity, so a live
requeued run is preserved.
* fix(diagnostics): reconcile tool/model activity on authoritative idle cleanup
clearDiagnosticEmbeddedRunActivityForSession (renamed from
clearDiagnosticEmbeddedRunsForSession) now clears the aborted run's tool and
model markers alongside the embedded-run owners, matching the default
markDiagnosticEmbeddedRunEnded teardown. Clearing only the owner set left the
lane as idle + orphaned tool/model activity, which
isIdleQueuedRecoverableSessionStall still treats as recoverable while work is
queued, so the liveness sweep kept re-triggering recovery instead of converging.
Adds regression cases with stale tool and model markers plus queued work.
* test(phone-control): align service mocks with keyed store API
* fix(diagnostics): preserve rearmed recovery activity
* fix(diagnostics): clear recovered owner markers
* fix(diagnostics): clear recovered embedded work keys
* fix(diagnostics): ignore stale same-key recovery owners
* fix(diagnostics): preserve same-session recovery rearm
* fix(diagnostics): ignore stale queued activity starts
* fix(diagnostics): record recovery cutoffs for empty activity
* fix(diagnostics): preserve fresh recovery markers
* fix(diagnostics): prune stale activity before fresh recovery block
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Fix JSONL file-log hostnames getting pinned to `unknown` when the first hostname read returns an empty value. The logger now retries empty hostname reads and caches the first non-empty value, keeping the top-level `hostname` and `_meta.hostname` fields aligned.
Fixes#87258.
Thanks @lonexreb for the fix.
Verification:
- `node scripts/run-vitest.mjs src/logging/logger-redaction-behavior.test.ts src/logger.test.ts`
- `node_modules/.bin/oxfmt --check --threads=1 src/logging/logger.ts src/logging/logger-redaction-behavior.test.ts`
- `.agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main`
- `gh pr checks 88131 --watch=false`
Co-authored-by: lonexreb <reach2shubhankar@gmail.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Align diagnostic stuck-session recovery in-flight dedup with the runtime recovery key. The coordinator now dedups by logical session ref only, so a mid-flight generation bump cannot emit a phantom `session.recovery.requested` event that runtime recovery skips as already in flight.
Adds a regression test for the idle-queued stall path where a queued message bumps generation while recovery is pending.
Fixes#88010
Extract shared normalization/coercion helpers into private @openclaw/normalization-core workspace package while preserving existing plugin SDK helper subpaths.\n\nAlso keeps direct normalization-core imports internal, wires UI/build/loader resolution, and replaces the slow PR network CodeQL lane with a fast added-line boundary scan while retaining full CodeQL for scheduled/manual runs.\n\nVerification: local moved tests, plugin SDK boundary tests, extension loader tests, agents-support shard, UI build/test, build artifacts, lint, workflow guards, autoreview, and GitHub CI passed on PR head 963d893715.
Reduce repeated gateway warning noise in startup/auth retry paths while preserving credential mismatch and rate-limit audit visibility.
Also hardens empty embedded-assistant retry handling by carrying lifecycle state through the missing-assistant guard, and keeps the relevant regression coverage in gateway and agent tests.
Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths.
Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress.
Refs #84903.
Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com>
Summary:
- The PR updates diagnostics to mark streamed model chunks as run progress, keeps silent model calls abortable after the stuck-session timeout, and adds regression coverage for stream progress and recovery behavior.
- PR surface: Source +54, Tests +229. Total +283 across 6 files.
- Reproducibility: yes. at source level: current main tracks model-call start/end activity but streamed chunks ... covery keys on stale lastProgressAgeMs. I did not run a live local-provider repro in this read-only review.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(diagnostics): track model stream progress
- PR branch already contained follow-up commit before automerge: test(diagnostics): cover silent local model aborts
- PR branch already contained follow-up commit before automerge: fix(diagnostics): skip stream progress when disabled
Validation:
- ClawSweeper review passed for head fcc74d9869.
- Required merge gates passed before the squash merge.
Prepared head SHA: fcc74d9869
Review: https://github.com/openclaw/openclaw/pull/86757#issuecomment-4540111930
Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: osolmaz
Co-authored-by: osolmaz <2453968+osolmaz@users.noreply.github.com>
When the gateway process is orphaned after a systemd service restart,
the parent's journal pipe closes and every write to stdout/stderr returns
EPIPE. The previous handler swallowed it with a bare return, so background
loops (config file watcher, etc.) kept firing and the process spun at
100% CPU indefinitely.
Exit cleanly with code 0 instead — a process whose own output streams
are broken has nowhere to log and no reason to keep running.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refactor diagnostic queued/state/processed emission into a shared helper used by dispatch and isolated cron turns.
Preserve dispatch processed-event behavior, cron queue-depth symmetry, and final cron session-id adoption while adding focused helper coverage and reviewer comments for the non-obvious invariants.
Reverts the diagnostic queue-pressure suppression of non-terminal session tool mirrors from PR 84846 while keeping PR 86503 recipient dedupe intact. Session-only Control UI subscribers keep receiving tool lifecycle mirrors; overlapping run and session subscribers still receive one canonical run-scoped frame. Verification: focused gateway and diagnostic tests, diff check, changed check, and autoreview all passed.
* fix(diagnostics): reclaim wedged session lanes with a stale leaked active run
A group session lane could wedge permanently (#85639): an embedded run that dies
abnormally leaves a stale ACTIVE_EMBEDDED_RUNS handle, so the diagnostic heartbeat
classifies the lane stale_session_state (recoveryEligible without allowActiveAbort)
while stuck-session recovery reads the leaked isEmbeddedPiRunActive flag and skips
with active_reply_work — a tautology that keeps the lane forever. The age-based
escape never fires because ageMs (last-activity) resets on every incoming queued
message.
Make the active-run skip a liveness check: before keeping the lane, consult the
run's real forward-progress age (lastProgressAgeMs, not refreshed by incoming
messages). If a run flagged active has made no forward progress past the resolved
diagnostics.stuckSessionAbortMs threshold (threaded through the recovery request;
falls back to a 5-minute floor) with queued work waiting, treat it as a
leaked/dead handle and reclaim it (abort + drain + force-clear) instead of
skipping. A genuinely progressing run, or one within an operator-raised
threshold, is kept.
Fixes#85639
* test(diagnostics): cover stale active run recovery
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary:
- This replacement PR adds inbound delivery diagnostic events, gateway status counters and warnings, transport ... ut, Prometheus/OpenTelemetry metrics, docs, changelog, and regression coverage for gateway delivery health.
- Reproducibility: no. high-confidence live reproduction of the original Feishu failure was run here. Source i ... ch/turn telemetry, and the source PR supplies after-fix live output for the connected WebChat gateway path.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(types): restore PR conflict resolution type checks
Validation:
- ClawSweeper review passed for head 6ffe08a9c7.
- Required merge gates passed before the squash merge.
Prepared head SHA: 6ffe08a9c7
Review: https://github.com/openclaw/openclaw/pull/85016#issuecomment-4510224436
Co-authored-by: Andi Liao <liaoandi95@gmail.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
Summary:
- This PR filters partial skill snapshot entries in trajectory support metadata, accepts nullish support-redaction paths, adds regression tests, and records the fix in the changelog.
- Reproducibility: yes. Source inspection on current main shows undefined skill path/name values can reach str ... and the related source PR provides redacted live before/after gateway logs for the symlink-escape scenario.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(trajectory): tighten test types for partial skill entries
- PR branch already contained follow-up commit before automerge: fix(trajectory): tolerate partial skill snapshot entries in support c…
Validation:
- ClawSweeper review passed for head ecb3df6c08.
- Required merge gates passed before the squash merge.
Prepared head SHA: ecb3df6c08
Review: https://github.com/openclaw/openclaw/pull/84797#issuecomment-4504703074
Co-authored-by: Luke Boyett <46942646+lukeboyett@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
Fix diagnostics/session usage limit handling and voice-call numeric CLI validation.
- Treat explicit zero, negative, and non-finite diagnostics/session limits as empty results instead of falling back to defaults.
- Reject invalid, non-finite, and fractional voice-call numeric flags.
- Add focused tests and a live repro proof for the canonical edge cases.
Fixes#82646, #82650, #82651, #82653.
Co-authored-by: wuyangfan <1102042793@qq.com>