* fix: recover terminal session status on visible inbound turns (#86827)
When a group chat session enters a terminal status (failed/timeout/killed),
subsequent visible inbound messages now automatically recover the session by
clearing stale lifecycle fields while preserving the session ID and transcript
continuity.
Changes:
- session.ts: detect terminal status on visible turns and clear status/
startedAt/endedAt/runtimeMs/abortedLastRun without rotating sessionId
- dispatch-from-config.ts: force-clear stale active reply operations for
terminal sessions and retry admission once
- agent.ts: mirror terminal recovery in the agent API dispatch path
- kernel.ts: add zero-count visible dispatch warning diagnostic
- types.ts: add 'warning' to ChannelTurnLogEvent event union
* fix: guard terminal recovery from concurrent force-clear race
When two visible messages arrive against the same terminal session
snapshot, the second turn could force-fail the first turn's freshly
admitted recovery operation, dropping the very message the recovery
path exists to save.
Add a terminalRecovery flag on ReplyOperation that is set after a
recovery turn clears the proven stale leftover and admits its own
operation. The force-clear branch now skips operations marked as
in-flight terminal recoveries, letting concurrent turns fall through
to normal busy/wait handling instead.
Add a two-turn regression test that gates the first recovery turn
open, races a second visible turn against the same terminal snapshot,
and asserts neither turn's operation is incorrectly killed.
Also fix missing FinalizedMsgContext import in kernel.ts.
* fix: avoid return value in Promise executor to satisfy lint
* fix: gate terminal force-clear to visible reply turns
A heartbeat/control turn can pass the early active-run short-circuit, reach
the terminal force-clear branch, and abort an in-flight visible recovery
operation that a concurrent visible turn just admitted (before that op is
marked terminalRecovery). Gate the force-clear to visible reply turns so
non-visible turns fall through to normal busy/skip handling instead of
killing the recovery they are meant to protect.
Adds a focused regression test exercising a heartbeat turn that reaches the
force-clear branch against a terminal session snapshot with an active
recovery operation present; it must leave that operation intact and skip.
* fix: verify session identity before terminal force-clear
* fix: mark clean no-stale terminal recovery to survive concurrent visible turn
The same-session terminal-recovery race could still drop a reply when two
visible turns raced the same failed snapshot. The terminalRecovery marker was
set only on the post-force-clear re-admission path, never on the clean no-stale
admission path. So when the first turn admitted cleanly (no stale op to clear),
its recovery op stayed terminalRecovery=false and a second concurrent visible
turn force-cleared it, dropping the first reply (#86827).
Consolidate marking to the single owned-admission choke point that both the
clean no-stale admission and the re-admission-after-force-clear flow through.
Genuine stale leftovers from the original failed run never pass through this
admission, so they stay unmarked and remain force-clearable.
Add a no-stale regression twin to the existing stale-race test.
* fix: suppress zero-count visible-dispatch warning for observed-delivery turns
maybeWarnZeroCountVisibleDispatch re-implemented a partial visibility
check that omitted observedReplyDelivery, so visible turns delivered via
the observed-delivery path (queuedFinal=false, zero counts,
observedReplyDelivery=true) falsely tripped the silent-drop sentinel and
emitted a bogus zero-count-visible-dispatch warning/event.
Use the canonical hasVisibleChannelTurnDispatch helper for the warning
suppression so all non-count delivery paths (observedReplyDelivery,
fallback, summary, queuedFinal) are honored. Add regression tests
covering the observed-delivery case (no warning) and a genuinely empty
visible dispatch (still warns).
* test: read persisted session store in failed-group recovery test
---------
Co-authored-by: 忻役 <xinyi@mininglamp.com>
* fix(cron): guard against undefined sourceDelivery in isolated executor
* fix: narrow legacy sourceReplyDeliveryMode to closed union
Validate legacy sourceReplyDeliveryMode values against the
SourceReplyDeliveryMode union before forwarding them to typed runner
APIs. Unrecognized legacy values now fall back to the sourceDelivery
plan with a warning log.
* fix: align test imports with upstream renames
- SkillSnapshot moved from agents/skills.js to skills/types.js
- runEmbeddedPiAgentMock renamed to runEmbeddedAgentMock
* fix(cron): align sourceDelivery fallback defaults with current cron semantics
* refactor(cron): extract shared source-delivery fallback helper aligned with current main semantics
Replace inline legacy field sniffing in both createCronPromptExecutor and
executeCronRun with a shared resolveFallbackCronSourceDeliveryPlan helper
that derives webhook/none/announce plans from resolveCronDeliveryPlan(job)
and resolvedDelivery, matching current main's resolveCronSourceDeliveryPlan
exactly.
Changes:
- Created source-delivery-fallback.ts with shared helper
- Removed all legacy toolPolicy/sourceReplyDeliveryMode/messageChannel reads
- Updated guard tests to drive fallback via job delivery config
- Added tests verifying stale legacy fields are ignored
- All 12 guard tests pass, 99 cron files / 1002 tests pass
* refactor(cron): unify source-delivery planner into single shared helper
Extract resolveCronSourceDeliveryPlan as the single canonical cron
source-delivery planner in source-delivery-fallback.ts. Both the normal
cron execution path (run.ts) and the version-skew fallback path
(run-executor.ts) now use this shared helper, eliminating the duplicate
implementation that ClawSweeper flagged as P1.
- Remove private resolveCronSourceDeliveryPlan from run.ts
- Export shared resolveCronSourceDeliveryPlan from source-delivery-fallback.ts
- Keep resolveFallbackCronSourceDeliveryPlan as thin wrapper for executor
- Add CronSourceDeliveryResolvedTarget type for shared contract
- Add parity tests verifying normal path matches fallback wrapper
- All 1000 cron tests pass, autoreview clean
* fix: preserve announce fallback skip state when resolvedDelivery.ok is absent
When a stale executor artifact passes resolvedDelivery without the ok
field, the announce fallback path set skipFallbackWhenMessageToolSentToTarget
to undefined (falsy), causing the direct fallback to fire even when the
message tool already delivered — resulting in double-posted messages.
Default ok to true (skip fallback when message tool sent) so stale callers
preserve duplicate suppression. Callers that explicitly set ok=false still
get fallback delivery as intended.
Add regression test for the missing-ok announce fallback path.
---------
Co-authored-by: 忻役 <xinyi@mininglamp.com>
The multi-account resolver had two bugs that prevented webhook routes
from registering:
1. `accounts.default` was ignored because `resolveLineAccount` short-
circuited the account lookup whenever `accountId` resolved to
`DEFAULT_ACCOUNT_ID`. Credentials placed under
`channels.line.accounts.default` therefore never reached the
gateway, and the default `/line/webhook` route never registered.
2. Named accounts defaulted to disabled when they did not explicitly
set `enabled: true`. A configured second account
(`channels.line.accounts.<name>`) was treated as disabled before
`startAccount` ran, so its webhook never registered either.
Both behaviours diverged from the Telegram plugin, which uses
`baseEnabled && accountEnabled` with each defaulting to
`enabled !== false`. Align the LINE resolver with that idiom and add
seven regression tests covering the credential lookup, the default-
enabled semantics, and channel-level disable propagation.
* fix(cli): make models aliases remove honest about built-in aliases
`openclaw models aliases list` shows aliases materialized by
`applyModelDefaults` (the DEFAULT_MODEL_ALIASES table), but
`modelsAliasesRemoveCommand` only mutates the user's source config —
so any built-in alias the user sees in `list` cannot be removed and
the user gets `Error: Alias not found: <X>. Run openclaw models
aliases list to see configured aliases.` despite `<X>` being right
there in the output.
Detect this case in the remove handler and throw an actionable error
pointing the user at `aliases add <name> <model>` as the override path.
The set of visible aliases, `--json` output, and `--plain` output are
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cli): treat explicit alias: "" as user opt-out in models aliases remove
Tighten the built-in-alias guard in modelsAliasesRemoveCommand to match
applyModelDefaults's materialization contract (entry.alias === undefined,
src/config/defaults.ts:337). The previous falsy check misrouted users who
explicitly disabled a default alias via alias: "" to the 'is a built-in'
error path; list correctly omits such aliases, so remove should follow
suit and return the plain 'Alias not found' message.
Adds a regression test mirroring src/config/model-alias-defaults.test.ts:106.
Addresses clawsweeper review on #81641 (P2 at aliases.ts:106).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cli): normalize source map in models aliases remove built-in guard
The prior built-in-alias guard checked the un-normalized source map, but
applyModelDefaults materializes default aliases against the *normalized*
model map (provider ids and retired Google preview keys canonicalized via
normalizeAgentModelMapForConfig). So a config holding only a retired key
like google/gemini-3-pro-preview surfaces the `gemini` alias in `list`,
yet `remove gemini` fell through to the misleading "Alias not found" path
instead of the actionable built-in error.
Normalize nextModels before the guard lookup to match the materialization
contract. The write-back path still uses the un-normalized map, so user
config keys are not silently rewritten on save.
Adds a regression test mirroring src/config/model-alias-defaults.test.ts:135-144.
Addresses clawsweeper re-review on #81641 (P2 normalized-key gap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cli): handle unknown alias removal errors
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Vincent Koc <25068+vincentkoc@users.noreply.github.com>
* fix(usage): fill missing calendar days with zero entries in daily summary
* fix(usage): cap zero-fill at 366 days for all-time / unbounded ranges
Addresses clawsweeper P2 review on #81467:
The usage page's 'All' range filter sends startDate: 1970-01-01 through
the same cost-summary path, and the gateway also accepts range=all with
startMs=0. Unconditionally zero-filling that span would synthesize ~20k
empty daily buckets per call (one per calendar day since 1970), bloating
the payload and DOM without any user value.
Gate fillMissingDays() on a 366-day window (full year + leap cushion).
Bounded picker ranges (7d / 30d / 90d) keep the dense, contiguous
behavior the original PR added; all-time and otherwise unbounded ranges
fall back to the prior sparse (activity-only) shape.
Added test 'falls back to sparse output for all-time / unbounded
ranges' covering the startMs=0 case. All 41 tests in the file pass;
oxlint clean.
* fix(usage): iterate fillMissingDays by calendar-day keys (DST-safe)
Addresses ClawSweeper P2 review on PR #81467.
The prior fillMissingDays helper advanced its cursor by a fixed 24h in
milliseconds. In a bounded range whose startMs lands late in the local
evening before a spring-forward DST transition (e.g. 2026-03-07 23:30
America/Denver), a 24h ms step skips past the DST-shortened day (23h
in the local clock) and lands on the next-next calendar day. The
end-key fallback only inserts the final day, so the chart can still
miss an interior calendar day with no zero-fill entry.
Switches to calendar-day iteration:
- New parseDayKeyToLocalNoon(): parses a YYYY-MM-DD day key (as
produced by formatDayKey) into a Date anchored at local noon on
that date. Noon anchoring gives a +/- 12h cushion so the resulting
Date always formats back to the same key via formatDayKey, even
across the +/- 1h DST shift.
- fillMissingDays now derives startKey/endKey via formatDayKey, anchors
a cursor at local noon of startKey, and advances via
setDate(getDate() + 1). setDate handles month/year rollover; the
local-noon anchor neutralises DST cliffs. Loop cap kept as a
hard upper bound; end-key fallback retained as a defensive belt.
- All-time / unbounded behavior unchanged (still capped at
MAX_ZERO_FILL_DAYS = 366 -> sparse fallback above that).
New regression test:
- 'fills every calendar day in a bounded range that spans a
spring-forward DST transition' constructs a 7-day window straddling
2026-03-08 (US/Mountain spring-forward) and asserts every calendar
day from 2026-03-07 through 2026-03-13 is present, with no skipped
day. process.env.TZ can't be flipped at runtime in vitest workers
(V8 caches the system timezone for
Intl.DateTimeFormat().resolvedOptions() at process startup), so the
test stubs Intl.DateTimeFormat to report America/Denver via
vi.stubGlobal, which is exactly what formatDayKey consumes. The old
helper fails this test (skips 2026-03-08); the new one passes.
42/42 tests in src/infra/session-cost-usage.test.ts pass.
oxlint clean. tsgo:core clean for the touched file.
---------
Co-authored-by: Ada Sandpaw <ada@sandpaw.ai>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
isStaggeredCronRunAtMs probed the cron library at runAtMs + 1 to decide
whether a persisted nextRunAtMs was a real schedule slot. Croner-style
second-granular schedules normalize that 1ms probe back to the candidate
second, so previousRuns(1, runAtMs + 1) returns the slot before the
candidate instead of the candidate itself. shouldRepairFutureCronNextRunAtMs
then classified valid exact-second slots two-plus intervals out as stale
and rebased them.
Probe at runAtMs + 1_000 instead so the previous-run lookup lands past the
candidate second, matching the +1s cursor step used elsewhere in this file.
Fixes#81691
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
The previous implementation split on ':subagent:' which requires
colons on both sides. Legacy session keys like 'subagent:worker'
(starting without a colon) were missed and returned depth 0.
Fix: normalize key, match (^|:)subagent: pattern to correctly count
nested subagent levels for both legacy and current key formats.
- Add test cases for legacy key formats
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
* fix(cli): narrow config hint branch
* Plan config hints from actual changes
* Plan direct unset hints from actual changes
* Expand broad unset hint paths
* fix(cli): respect reload mode in config hints
---------
Co-authored-by: Kiran Magic <kiran@Alices-Laptop.local>
Co-authored-by: kiranmagic7 <262980978+kiranmagic7@users.noreply.github.com>
* fix(cron): truncate failure alert error text on UTF-16 boundary
* test(cron): add regression test for UTF-16 safe failure alert truncation
Verify that emitFailureAlert does not produce dangling surrogates when
truncating a cron failure error message with a non-BMP character (emoji)
straddling the 200-code-unit truncation boundary.
* fix(test): cover last code unit in surrogate scan
The dangling surrogate check loop stopped at length-1, missing a
high surrogate at the final position. Extend to i < alertText.length
so charCodeAt(i+1) returns NaN for the last char, correctly failing
the high-surrogate pair assertion.
* fix(test): address lint issues in regression test
- Use template literal instead of string concatenation
- Add braces to type guard if-statement
* fix(shared): use UTF-16 safe truncation in subagent line display
truncateLine sliced by code units instead of preserving surrogate pairs,
causing emoji / CJK Extension B characters at the truncation boundary
to display as broken replacement characters.
* fix(shared): move import above export to satisfy import/first lint rule
* hash: use SHA-256 for bundle MCP fingerprints
* Replace SHA-1 with SHA-256 for config fingerprinting
* Replace SHA-1 with SHA-256 for config fingerprinting
---------
Co-authored-by: openclaw-clownfish[bot] <280122609+openclaw-clownfish[bot]@users.noreply.github.com>
* fix(web-fetch): decode HTML entities via the shared canonical decoder
web_fetch's hand-rolled decodeEntities used String.fromCharCode (truncating
astral entities like emoji to garbage) and decoded & first (double-decoding
"&#39;" into "'"). Route entity decoding through the shared
decodeHtmlEntityAt in agents/utils/html.ts so web_fetch and the renderer share
one entity contract — the divergence is what produced the bug. A single
left-to-right pass also avoids the double-decode (the "&" is consumed before
its trailing "#39;" is seen as an entity); is mapped to a space, which
the shared decoder does not cover.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(web-fetch): preserve entity contract in shared HTML entity decoder
Addresses the ClawSweeper review: decodeHtmlEntityAt matched only lowercase
named entities and used a lenient Number.parseInt, so routing web_fetch through
it changed the prior contract — uppercase forms like & were left escaped,
and a malformed numeric entity like 'x; was consumed as "'" (parseInt stops
at the first non-digit). Match named entities case-insensitively and require
numeric references to be fully valid digit/hex tokens, restoring web_fetch's
/gi-style named matching and strict numeric handling while keeping the
single-pass shared-decoder boundary. Strictly more correct for the other caller
(syntax-highlight), which only emits well-formed lowercase entities. Adds
regression tests for uppercase named entities and malformed numeric input.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ci: retry — cross-os socket-close flake is main-branch, unrelated to web-fetch entity decode
---------
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(agent-tools): resolve workspace-scoped tool fs root lazily
Workspace-scoped edit/write tools resolved their fs-safe root eagerly at
construction. Doctor's active-tool schema projection builds the full coding
toolset just to read tool schemas; when an agent workspace dir is absent at
that time (e.g. an unresolved ${ENV} placeholder in the authored config the
legacy-migration path operates on), the eager fsRoot orphaned a rejecting
promise as 'Unhandled promise rejection: FsSafeError: root dir not found'.
Resolve the root lazily and memoized so construction never opens an fs handle;
the root is only opened on the first real read/write/access operation.
* fix(agent-tools): defer fs-safe root start until workspace write validation succeeds
A workspace-only write/edit against an absent root started fsRoot(root) eagerly
(passed as getRoot() into writeWorkspaceFile) before toCanonicalRelativeWorkspacePath
ran. When validation failed (realpath on the missing root), the rejecting root promise
was left unawaited and surfaced as an unhandled rejection — the readFile/access paths
already defer getRoot() the same way.
writeWorkspaceFile now takes a getRoot thunk and calls it only after validation; the
write and edit-write callers pass the thunk instead of a started promise. Adds a
regression that a missing-root write/edit rejects without starting the fs-safe root or
emitting an unhandled rejection.
---------
Co-authored-by: Sasan <sasan.sotoodehfar@gmail.com>
Co-authored-by: Gio Della-Libera <giodl73@gmail.com>
truncateLine could cut a surrogate pair when the maxChars boundary
fell between a high surrogate and its paired low surrogate, producing
a broken unpaired surrogate in grep tool output.
Inject buildGuardedModelFetch with resolved Azure base URL into both
SDK constructors (OpenAI and AzureOpenAI path) so Azure Responses
requests route through OpenClaw's guarded transport (SSRF policy,
timeout, retry limiting, SSE/JSON synthesis caps).
The non-OK response body cap is applied lazily inside the shared
sanitizeOpenAISdkSseResponse guard via TransformStream — closes the
unbounded response.text() OOM path for hostile 4xx/5xx responses
while preserving the OpenAI SDK's ability to cancel and retry.
- src/llm/providers/azure-openai-responses.ts: pass guardedFetch into
both OpenAI and AzureOpenAI SDK constructors (no per-provider wrapper)
- src/agents/provider-transport-fetch.ts: add capNonOkResponseBodyLazily
helper and wire into sanitizeOpenAISdkSseResponse for !response.ok
- src/agents/provider-transport-fetch.test.ts: 2 inline tests
(cap fires + SDK cancel preserves source)
Co-authored-by: Claude <noreply@anthropic.com>
Wrap the signal outbound sanitizeText hook with sanitizeAssistantVisibleText so assistant internal tool-trace scaffolding is stripped before delivery, matching the sibling channel fixes under #90684 (Telegram #95774, Google Chat #95084, IRC #97214).
Wrap the slack outbound sanitizeText hook with sanitizeAssistantVisibleText so assistant internal tool-trace scaffolding is stripped before delivery, matching the sibling channel fixes under #90684 (Telegram #95774, Google Chat #95084, IRC #97214).
truncateSummary used clean.slice(0, max - 3), which can cut between the
two UTF-16 halves of a surrogate pair (emoji / astral char) straddling
the limit. The serialized card summary then carries a lone high
surrogate that Feishu renders as the replacement char.
Slice with the surrogate-safe sliceUtf16Safe helper instead, matching
the pattern already used in extensions/slack/src/truncate.ts, so a
straddling code point is dropped whole.
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
splitTableCells/splitPartialTableCells split on every '|', including a GFM
backslash-escaped pipe ('\|'), which is literal cell content rather than a
column delimiter. A cell containing '\|' was therefore mis-counted as multiple
columns, so the oversized-row fallback (renderTableRowAsFields) rendered the
trailing content under the wrong header.
Split via an escape-aware scan that treats '\|' as a literal '|' (and '\\' as
a literal backslash so a following '|' still delimits). Behavior is byte-for-byte
unchanged for any row without an escaped pipe, so existing chunking is preserved.
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(line): truncate template title/altText on grapheme boundaries, not raw UTF-16
createConfirmTemplate/createButtonTemplate/createTemplateCarousel/createCarouselColumn/
createImageCarousel truncated title and altText with a raw `.slice(0, N)`, so an
emoji straddling a LINE field limit (e.g. a 40-char button title) was cut in half,
leaving a lone high surrogate that LINE renders as the replacement char or rejects.
Route those fields through the file's existing grapheme-safe truncateTemplateText
(already used for the text body) via a small truncateOptionalTemplateText wrapper.
Byte-identical for all-BMP input; only straddling-emoji truncation changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: retry OpenGrep scan (HTTP 502 infra flake)
* test(line): cover grapheme-safe template fields
---------
Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Wrap the matrix outbound sanitizeText hook with sanitizeAssistantVisibleText so assistant internal tool-trace scaffolding is stripped before delivery, matching the sibling channel fixes under #90684 (Telegram #95774, Google Chat #95084, IRC #97214).