* fix(sessions): stop doctor OOM on large session stores and reclaim stale store temps
`openclaw doctor` loaded the full sessions.json via loadSessionStore with the
default cache-write plus return clone, materializing a multi-hundred-MB
monolithic store several times and exhausting the heap (#56827). The read-only
doctor checks (state integrity, heartbeat target, codex route scan) now load
with { skipCache: true, clone: false } so the store is materialized once.
Orphaned session-store atomic-write temps were also never reclaimed: the store
write went through the generic atomic writer, staging a shared
.fs-safe-replace.<pid>.<uuid>.tmp not identifiable as a store temp. Give the
store write a store-specific tempPrefix so its temps stage as
sessions.json.<pid>.<uuid>.tmp, classify them (isSessionStoreTempArtifactName),
and reclaim stale ones via the disk-budget sweep and the unreferenced-artifact
prune on a short staleness window so in-flight temps are preserved.
Fixes#56827
* docs(changelog): note large session store doctor fix
* test(qa): preserve WhatsApp RTT source literal
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary:
- This PR adds an Ollama Kimi-cloud visible-content sanitizer for streamed and final assistant replies, updates stream handling and regression tests, and adds a changelog entry.
- PR surface: Source +183, Tests +473, Docs +1. Total +657 across 7 files.
- Reproducibility: yes. from source and the linked report: current main appends Ollama `message.content` direc ... payload described in the issue would be shown. I did not run a live vendor repro in this read-only review.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(ollama): sanitize kimi inline reasoning in stream events
- PR branch already contained follow-up commit before automerge: fix(ollama): buffer kimi cloud stream reasoning
- PR branch already contained follow-up commit before automerge: fix(ollama): cover kimi inline boundary variants
- PR branch already contained follow-up commit before automerge: fix(ollama): preserve text start partial state
- PR branch already contained follow-up commit before automerge: fix(ollama): bound kimi stream sanitizer hold
- PR branch already contained follow-up commit before automerge: fix(ollama): keep kimi sanitizer deltas append-only
Validation:
- ClawSweeper review passed for head b709229157.
- Required merge gates passed before the squash merge.
Prepared head SHA: b709229157
Review: https://github.com/openclaw/openclaw/pull/86515#issuecomment-4534945393
Co-authored-by: Jason O'Neal <jason.allen.oneal@gmail.com>
Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: osolmaz
Co-authored-by: osolmaz <2453968+osolmaz@users.noreply.github.com>
Summary:
- This PR changes the shared block reply coalescer/pipeline so compatible buffered visible text is merged into a following media payload, adds focused regression tests, and records a Discord changelog fix.
- PR surface: Source +50, Tests +175, Docs +1. Total +226 across 6 files.
- Reproducibility: yes. Current main has a clear source reproduction path: media enqueue forces a text flush and then sends the media payload separately, and the PR adds focused tests for the corrected merge path.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix: route streamed media through reply coalescer
- PR branch already contained follow-up commit before automerge: fix(discord): merge media captions into one message
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8648…
Validation:
- ClawSweeper review passed for head ceafbeaf3c.
- Required merge gates passed before the squash merge.
Prepared head SHA: ceafbeaf3c
Review: https://github.com/openclaw/openclaw/pull/86487#issuecomment-4534402219
Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
* fix(memory): prevent silent vector index degradation when embedding provider temporarily unavailable
Two related bugs cause complete loss of semantic vector data:
1. Promise cache deadlock in ensureProviderInitialized():
When the embedding provider (e.g. local MLX server on port 8123) is
temporarily unreachable at Gateway startup, loadProviderResult() throws
and providerInitPromise becomes a permanently-cached Rejected Promise.
The block only clears it on success (providerInitialized=true),
so the stale rejection blocks all future init attempts until Gateway restart.
2. Silent fts-only overwrite in runSync():
With the provider stuck at null, shouldRunFullMemoryReindex() compares
the stored meta.model (e.g. 'jina-embeddings-v5-text-small') against the
runtime provider model, and since provider is null, falls through to the
'meta.model !== fts-only' check — returning true. This triggers a full
reindex where every file is written as fts-only, silently erasing all
existing 11k+ semantic vectors.
Fix 1: Clear providerInitPromise in the catch block so the next call can
retry initialization (self-healing when the provider comes back online).
Fix 2: Guard runSync() — if requestedProvider is set and not 'none', but
the runtime provider is null, throw an error instead of silently degrading
to fts-only. This protects existing vector data by failing loudly.
Tested on production: 11,715 chunks + 1024-dim vectors fully preserved
after Gateway restart with the fix applied. The guard correctly blocks
sync when MLX is offline and allows normal operation when it recovers.
* fix: use this.settings.provider instead of private requestedProvider
The guard clause in runSync() was referencing this.requestedProvider
which is a private property on the MemoryIndexManager subclass and not
accessible from MemoryManagerSyncOps. Use this.settings.provider
instead, which is the same value and is accessible via the protected
abstract settings property.
* fix(memory): narrow degradation guard to only protect existing semantic indexes
The previous guard was too broad — it blocked sync for ALL non-none
provider configurations when provider was null, including the default
'auto' path where users without embedding credentials legitimately
build FTS-only indexes.
Narrow the guard to only abort when:
1. provider is null (embedding unavailable)
2. existing index metadata has a semantic model (not 'fts-only')
3. settings.provider is configured and not 'none'
This preserves the legitimate FTS-only fallback for auto/no-provider
users while still protecting existing semantic vector indexes from
silent degradation.
Reported-by: ClawSweeper (PR #85704 review)
* test: cover memory semantic index outage guard
* fix: protect semantic memory index fallback paths
* test: update memory sync harnesses
---------
Co-authored-by: Bo Yan <yaaboo-gif@users.noreply.github.com>
Co-authored-by: Yan Bo <yanbo@Mac.lan>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary:
- The branch replaces QQBot's hardcoded outbound response watchdog with a resolver based on existing agent/provider `timeoutSeconds` settings, adds regression tests, and updates the changelog.
- PR surface: Source +113, Tests +116, Docs +1. Total +230 across 5 files.
- Reproducibility: yes. at source level: current main and the latest release use a hardcoded 300000 ms QQBot o ... s an 1800s provider timeout. I did not run the reporter's live QQBot/Ollama setup in this read-only review.
Automerge notes:
- PR branch already contained follow-up commit before automerge: test(qqbot): cover slow provider response watchdog
- PR branch already contained follow-up commit before automerge: fix(qqbot): derive outbound watchdog from configured timeouts (#85267)
- PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8527…
Validation:
- ClawSweeper review passed for head 7bd829292a.
- Required merge gates passed before the squash merge.
Prepared head SHA: 7bd829292a
Review: https://github.com/openclaw/openclaw/pull/86500#issuecomment-4534669816
Co-authored-by: SymbolStar <symbolstar@users.noreply.github.com>
Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: osolmaz
Co-authored-by: osolmaz <2453968+osolmaz@users.noreply.github.com>
Summary:
- Adds a scoped ModelStudio/DashScope OpenAI-compatible guard for chat payloads with no non-empty user or assi ... turn, shared turn-detection helper coverage, prompt-skip handling, regression tests, and a changelog entry.
- PR surface: Source +83, Tests +298, Docs +1. Total +382 across 10 files.
- Reproducibility: yes. source-reproducible for the OpenClaw-side malformed payload shape: current main has no ... he exact qwen-long/qwen3-coder-plus provider error was not reproduced with the available DashScope account.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix: make OpenAI payload guard content-aware
- PR branch already contained follow-up commit before automerge: fix: scope openai payload turn guard
- PR branch already contained follow-up commit before automerge: Guard OpenAI chat payload turns
Validation:
- ClawSweeper review passed for head e16a3fe9f2.
- Required merge gates passed before the squash merge.
Prepared head SHA: e16a3fe9f2
Review: https://github.com/openclaw/openclaw/pull/86497#issuecomment-4534668405
Co-authored-by: Andy Ye <35905412+TurboTheTurtle@users.noreply.github.com>
Co-authored-by: Onur Solmaz <2453968+osolmaz@users.noreply.github.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: osolmaz
Co-authored-by: osolmaz <2453968+osolmaz@users.noreply.github.com>
Reverts the diagnostic queue-pressure suppression of non-terminal session tool mirrors from PR 84846 while keeping PR 86503 recipient dedupe intact. Session-only Control UI subscribers keep receiving tool lifecycle mirrors; overlapping run and session subscribers still receive one canonical run-scoped frame. Verification: focused gateway and diagnostic tests, diff check, changed check, and autoreview all passed.
* fix(agents): release embedded-attempt session lock on every exit path
The embedded run controller acquires its session write lock eagerly at
creation and released it only inside the post-run cleanup block. An
exception thrown in post-prompt processing skipped that block, so the lock
leaked to the live gateway process until the watchdog reclaimed it and
later requests to the session failed with SessionWriteLockTimeoutError.
Add an idempotent dispose() to the lock controller and call it from the
run's outer finally so the eagerly-held lock is released on every exit
path. Normal/aborted/timed-out runs still hand the lock to
acquireForCleanup first, so dispose() is a no-op then (no double release).
Fixes#86014
* fix: keep session lock teardown comment lean
* docs(changelog): note embedded session lock fix
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Dedupe gateway tool-event fanout so connections subscribed by both run and session receive the canonical run-scoped agent event only, while session-only subscribers keep the compatibility session.tool mirror.\n\nVerification:\n- node scripts/run-vitest.mjs src/gateway/server-chat.agent-events.test.ts\n- git diff --check\n- env -u OPENCLAW_TESTBOX pnpm check:changed\n- .agents/skills/autoreview/scripts/autoreview --mode local
Summary:
- The PR expands security audit, CLI docs, and tests so `hooks.token` reuse of active Gateway token/password auth is reported while password-mode Gateway startup remains compatible.
- PR surface: Source +178, Tests +311, Docs +14. Total +503 across 14 files.
- Reproducibility: yes. from source inspection: current main forwards a bearer token as both token and passwor ... ecause this review was read-only, but the linked issue and code path make the reproduction high confidence.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(cr-fmi-hook-ingress-token-unlocks-password-mode-gateway-auth): ap…
- PR branch already contained follow-up commit before automerge: fix: include trusted proxy password in hooks token reuse check
- PR branch already contained follow-up commit before automerge: fix(gateway): audit hooks password reuse without blocking startup
- PR branch already contained follow-up commit before automerge: fix: Hook ingress token unlocks password-mode gateway auth
Validation:
- ClawSweeper review passed for head 7c796b22ec.
- Required merge gates passed before the squash merge.
Prepared head SHA: 7c796b22ec
Review: https://github.com/openclaw/openclaw/pull/86453#issuecomment-4533831028
Co-authored-by: Coy Geek <65363919+coygeek@users.noreply.github.com>
Co-authored-by: jesse-merhi <79823012+jesse-merhi@users.noreply.github.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: jesse-merhi
* fix(diagnostics): reclaim wedged session lanes with a stale leaked active run
A group session lane could wedge permanently (#85639): an embedded run that dies
abnormally leaves a stale ACTIVE_EMBEDDED_RUNS handle, so the diagnostic heartbeat
classifies the lane stale_session_state (recoveryEligible without allowActiveAbort)
while stuck-session recovery reads the leaked isEmbeddedPiRunActive flag and skips
with active_reply_work — a tautology that keeps the lane forever. The age-based
escape never fires because ageMs (last-activity) resets on every incoming queued
message.
Make the active-run skip a liveness check: before keeping the lane, consult the
run's real forward-progress age (lastProgressAgeMs, not refreshed by incoming
messages). If a run flagged active has made no forward progress past the resolved
diagnostics.stuckSessionAbortMs threshold (threaded through the recovery request;
falls back to a 5-minute floor) with queued work waiting, treat it as a
leaked/dead handle and reclaim it (abort + drain + force-clear) instead of
skipping. A genuinely progressing run, or one within an operator-raised
threshold, is kept.
Fixes#85639
* test(diagnostics): cover stale active run recovery
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary:
- The PR adds HEIC/HEIF-to-JPEG normalization before media-understanding image description providers run, with regression tests and a changelog entry.
- PR surface: Source +58, Tests +82, Docs +1. Total +141 across 6 files.
- Reproducibility: yes. at source level: current main forwards HEIC buffers to `describeImage` without normali ... ody includes a red HEIC regression test before the patch. I did not execute tests in this read-only review.
Automerge notes:
- PR branch already contained follow-up commit before automerge: fix(media-understanding): normalize HEIC before image descriptions
Validation:
- ClawSweeper review passed for head ed34620bd7.
- Required merge gates passed before the squash merge.
Prepared head SHA: ed34620bd7
Review: https://github.com/openclaw/openclaw/pull/86037#issuecomment-4528578874
Co-authored-by: luoyanglang <hanwanlonga@gmail.com>
Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com>
Approved-by: takhoffman
Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com>
- Rotate OpenAI Realtime voice sessions on provider max-duration events without surfacing the expected expiry as a Discord voice error.
- Add lifecycle logging for Realtime rotation/reconnect and regression coverage for max-duration reconnect.
- Allowlist the existing Control UI chunking helper for the optional Knip unused-file guard so the dependency shard stays green on the current base.
Catch non-ENOENT load failures inside maybeRepairLegacyCronStore so an
unreadable ~/.openclaw/cron/jobs.json (e.g. root-owned 0600 inside
Docker) no longer aborts the rest of the doctor health checks. The
scheduler-side loadCronStore keeps its strict throw-on-read-failure
contract.
Closes#86102
Co-authored-by: 1052326311 <1052326311@users.noreply.github.com>
After config.patch writes new values to openclaw.json, a subsequent
SIGUSR1 in-process restart could overwrite them with a stale snapshot.
Root cause: run-loop's onIteration hook resets lanes and task registry,
but leaves the runtimeConfigSnapshot intact. loadConfig() then returns
the old snapshot via loadPinnedRuntimeConfig() instead of re-reading disk.
Fix: clearRuntimeConfigSnapshot() in the restart iteration hook so the
next startup reads fresh config from disk.
Refs #86350