* fix(llm): collapse cumulative openai-responses message snapshots instead of concatenating
Some openai-responses providers (observed: Bedrock Mantle with GPT-5.x
reasoning enabled, confirmed server-side via raw curl) re-emit the
assistant message as many cumulative snapshot items — each a
prefix-superset of the previous one — instead of a single final message
item. Both stream consumers appended one text block per item, so the
final visible reply, transcript, and replay context repeated the answer
once per snapshot (observed 49-80x).
Treat a same-phase message item whose text extends the immediately
preceding text block as a replacement: the prior block takes the longer
text, the duplicate block is dropped, and the first item's signature is
kept so replay and stream-item identity stay stable. Shrinking or
identical adjacent snapshots are dropped. Any non-message output item
(reasoning, tool call) is a real boundary that resets the collapse, so
distinct post-tool messages and reasoning replay pairing are untouched,
as are different-phase (commentary/final_answer) items. Applies to the
agent transport stream, the shared LLM consumer, and completed-response
backfill.
Fixes#91959. Reported by @phoenixyy with server-side evidence from
@DaiMingNJ.
* test(llm): drop redundant stream drains from responses snapshot tests
* fix(llm): collapse only strict snapshot extensions and keep newest item signature
Address ClawSweeper P1 review findings on #92399: text-prefix relation
alone was broader than the observed corruption. Equal or shrinking
adjacent same-phase message items are now always kept as distinct blocks
(the Responses protocol allows multiple message items per response —
verified against the sibling Codex parser, codex-rs/codex-api/src/sse/
responses.rs, which emits every output_item.done message as an
independent item). With extension-only collapse a false positive can
only merge rendering of two messages; it can never remove text.
The merged block now carries the newest item's signature instead of the
first one's, so replay associates the final content with the item that
actually produced it.
* fix(llm): defer snapshot-candidate message blocks to keep the event lifecycle balanced
Address the remaining ClawSweeper P1 on #92399: collapsing a snapshot
used to pop a block whose text_start had already been emitted, leaving
per-index stream subscribers tracking a phantom block.
A message item that follows a finalized text block now defers its public
block: no text_start is emitted and deltas are withheld until the item
either diverges from the prior text (then the block opens and the
withheld prefix replays as one delta) or completes. A collapsed snapshot
therefore never starts a block — it only re-ends the prior index with
grown content, the documented resend shape — and a distinct deferred
item opens and closes its own block normally. No block is ever removed,
so every text_start has exactly one matching text_end at a live index.
Tests now assert the complete ordered event sequence for the collapse,
distinct-item, and divergence cases in both consumers.
* fix(llm): treat any non-message item as a collapse boundary in completed-response backfill
The streaming consumer resets the snapshot-collapse anchor on every
non-message output item ("any other item is a real boundary"), but the
transport's completed-response backfill only dispatched message and
function_call items, so a reasoning item between two strict-prefix
message items did not reset the anchor and the later message could
collapse across it — an asymmetry with the streaming path's documented
invariant. Reset lastTextBlock for every non-message item in the backfill
loop (one canonical place; the per-tool-call reset is now redundant and
removed). Covered by a backfill reasoning-boundary regression test.
Since #85341 the per-model visibility probes behind the chat /models command
(isCliRuntimeProvider({ includeSetupRegistry: true }) in commands-models.ts)
rebuild the plugin setup registry on every call: a synchronous ~65ms manifest
re-scan plus plugin setup module re-execution, issued hundreds of times per
listing. On the stock bundled plugin set this pins a CPU core for ~49s per
workflow step (list -> pick provider -> pick model), in every chat channel.
Cache the manifest scan and the resolved registry in bounded PluginLruCaches
keyed by the control-plane fingerprint, discovery-env fingerprint, metadata
snapshot identity, cwd, and pluginIds scope, with clone-on-store/clone-on-hit
isolation; invalidation rides the existing plugin-metadata lifecycle clear.
Output is identical; the /models data build drops from ~49s to ~150ms and the
per-model probe from ~65ms to ~0.2ms.
Add optional directUserId field to ChannelModelOverrideParams so the
shared channels.modelByChannel resolver can match DM-specific config
entries. Callers pass sessionEntry.origin?.nativeDirectUserId.
Closes#53638
Co-authored-by: Thomas Zhengtao <thomas.zhengtao@gmail.com>
A concurrent atomic rewrite (write-temp + rename) of a memory-wiki source
page by the bridge re-export made fs-safe's opened-fd identity check fail
with `path-mismatch`, which the page write rethrew as a fatal "Refusing to
write" error and aborted the whole wiki_status / source-sync call. The race
is transient and benign: the file is replaced under the open handle and the
concurrent writer lands equivalent content.
Retry briefly on `path-mismatch` (the rename window closes sub-ms) and
rethrow unchanged on exhaustion, so persistent failures (directory
collision, not-file) and symlink/path-alias swaps still hard-fail exactly
as before. The identity guard is untouched; only the benign rename race is
retried, matching the sibling read path that already treats path-mismatch
as transient.
Extracts the guarded-write logic duplicated by source-page-shared.ts and
okf.ts into one writeGuardedVaultPage helper so both write paths get the
fix and the copy is removed.
Closes#92134
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
cron.add recomputed every job's next-run time via recomputeNextRuns after
appending the new job. recomputeNextRuns advances nextRunAtMs whenever
now >= nextRun, so an unrelated add advanced any sibling recurring job whose
slot was due but had not yet fired, discarding that occurrence with no error
and no log. lastRunAtMs stayed unchanged while nextRunAtMs jumped one interval
forward, so the run was silently lost.
Switch add and remove onto recomputeNextRunsForMaintenance plus
ensureLoaded(state, { skipRecompute: true }), matching every other ops.ts
caller (read ops, update, finalize, reload, startup). Maintenance recompute
backfills missing next-run times but never advances a present past-due slot,
preserving the invariant introduced for the timer/read/startup paths in
#13992 / #16156 / #17852.
Adds a regression test that fails on main (the due slot advances a full
interval) and passes with the fix.
A completed session (status: done/success) whose abort controller expires
during maintenance was incorrectly matched by markRestartAbortedMainSessions.
The matched activeRun's lifecycleGeneration matched the current generation
(no restart occurred), but entry.updatedAt < run.observedAt allowed the
entry to be marked as running+aborted, triggering a false restart recovery.
Fix: require that the timing condition (updatedAt < observedAt) only applies
for stale-generation runs (provenance: pre-restart). Current-generation runs
with observedAt after the session's updatedAt are maintenance-expired abort
controllers and must not reopen completed sessions.
Related to #95443
* fix(agents): restore model-fetch info logs
* docs(logging): document [model-fetch] default info-level visibility
[model-fetch] response metadata is always emitted at info level
regardless of OPENCLAW_DEBUG_MODEL_TRANSPORT, so users see basic
model transport hygiene (provider, API, model, status, latency)
without needing debug flags.
* docs(logging): clarify model-fetch start metadata visibility
normalizeAgentEventType checked the `phase:"end" || status==="completed"`
branch before the `failed/blocked` branch, but terminal tool/item events are
emitted with phase:"end" AND the real status, so failed and blocked tools were
normalized to tool.call.completed and the tool.call.failed branch was dead for
the item stream. SDK consumers filtering on tool.call.failed never saw tool
failures (they looked like successes). Reorder so failed/blocked is classified
before end/completed.
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(diagnostics-otel): keep full model id on spans (was collapsing to "unknown")
* test(diagnostics-otel): cover slash model span attribution
---------
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>