Reduce repeated gateway warning noise in startup/auth retry paths while preserving credential mismatch and rate-limit audit visibility.
Also hardens empty embedded-assistant retry handling by carrying lifecycle state through the missing-assistant guard, and keeps the relevant regression coverage in gateway and agent tests.
Wire QA fallback models into live gateway config, fix Slack allowlist-block coverage, and keep WhatsApp live artifacts useful while redacting raw credential metadata.\n\nVerification: focused QA Vitest; autoreview clean; AWS Crabbox pnpm check:changed run_0207de7d47aa; QA-Lab branch-defined transport run 26565521272 with Matrix transport 56/56 and Slack/Discord/Telegram/parity clear. WhatsApp remains blocked by stale shared Convex WhatsApp Web credentials returning Baileys 401 before scenarios.
* fix(telegram): enable TCP keepalive on getUpdates connections to prevent NAT timeout stalls
Long-polling connections to api.telegram.org stay idle for up to the
getUpdates timeout (~900 s). Most home/office NAT tables expire idle TCP
entries after 60–1800 s (commonly ~1000 s). When the NAT entry is
silently dropped the connection hangs rather than returning an error,
leaving the grammY runner stuck until the 90 s stall watchdog fires and
forces a restart cycle.
Fix: unconditionally set `keepAlive: true` and
`keepAliveInitialDelay: 30_000` (30 s) on the undici Agent `connect`
options built in `buildTelegramConnectOptions`. OS-level TCP keepalive
probes sent every ~75 s (OS default) will:
1. Refresh the NAT table entry before it expires.
2. Surface dead connections immediately with ETIMEDOUT instead of
hanging forever.
The `return Object.keys(connect).length > 0 ? connect : null` guard is
also removed; `connect` is now always non-empty so it always returns the
object.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit 92e454c0614256201cdf6f0f73c7897d006616d4)
* fix(telegram): stop self-flagging disconnected on poll-cycle start; widen channel connect grace to 300s
(cherry picked from commit 1ca963a05dac0d9d605e9a15dc97fced9cf7725e)
* fix(telegram): catch hung polling startups that preserve inherited connected:true
The widened 300s channel connect grace and the removal of connected:false from
notePollingStart left a path where a polling restart could hang forever
looking healthy. notePollingStart clears lastConnectedAt, lastEventAt, and
lastTransportActivityAt but deliberately omits connected, so server-channels'
patch-merge inherits a connected:true from the previous lifecycle. After grace,
evaluateChannelHealth's stale-socket branch requires lastTransportActivityAt
to be non-null and the connected:false branch is masked, so the channel sits
healthy with no first getUpdates.
Add a post-grace branch to evaluateChannelHealth that flags polling channels
as stale-socket when connected:true is paired with null lastConnectedAt and
null lastTransportActivityAt and a non-null lastStartAt. Scoped to mode:polling
so webhook channels and channels without continuous transport tracking are
not falsely flagged. Align TELEGRAM_POLLING_CONNECT_GRACE_MS in the Telegram
status diagnostic with DEFAULT_CHANNEL_CONNECT_GRACE_MS so openclaw channels
status agrees with the shared health monitor on the grace window. Refresh
the notePollingStart comment to point at the new evaluateChannelHealth branch.
Addresses clawsweeper review on #83304 (P1 connect-grace startup-hang, P2
diagnostic grace drift). Tests cover the new flagged path, the in-grace happy
path, and the prior-successful-connect happy path.
* fix(telegram): clear polling connected state on startup
* fix(gateway): add defense-in-depth health-policy branch for hung polling startups
Defense in depth on top of 87db46c576's notePollingStart connected:false fix.
The primary path (notePollingStart writes connected:false explicitly so
evaluateChannelHealth's existing connected===false branch catches a hung
restart) is unchanged. This adds a defensive post-grace branch that catches
the same hang via a different signature -- inherited connected:true paired
with null lastConnectedAt and null lastTransportActivityAt -- in case a
future code path forgets to clear the inherited connected flag on lifecycle
start. Scoped to mode:polling so webhook channels and channels without
continuous transport tracking are not falsely flagged.
Also bump lastStartAt: Date.now() - 121_000 to 301_000 in the spool-handler
timeout test added by upstream #83505 so it falls past the widened 300s
TELEGRAM_POLLING_CONNECT_GRACE_MS suppression window (mirroring the same
fixup already applied to the two adjacent polling-startup tests).
* revert(telegram,gateway): keep connect grace at 120s
Drop the 120s -> 300s widening from this PR after maintainer feedback that
the extra grace masks real startup bugs. The defense-in-depth checks added
in earlier commits (notePollingStart clearing inherited connected state,
the stale-socket policy branch, the per-snapshot startup grace test) all
work fine at 120s and remain valuable on their own.
Reverts in:
- src/gateway/channel-health-policy.ts: DEFAULT_CHANNEL_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status-issues.ts: TELEGRAM_POLLING_CONNECT_GRACE_MS 300 -> 120
- extensions/telegram/src/status.test.ts: lastStartAt 301_000 -> 121_000 (3 cases)
The new channel-health-policy.test.ts cases use explicit channelConnectGraceMs:
10_000 in the policy, so they are unaffected by the default constant change.
* fix(telegram): narrow polling keepalive fix
---------
Co-authored-by: Yibei Ou <yibeiou@Yibeis-Mac-mini.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Harden QQBot direct media URL uploads by downloading through the local SSRF guard before QQ upload, disabling redirects, bounding fetch/setup and body reads, and routing downloaded buffers through the existing one-shot/chunked size gate.
Co-authored-by: Agustin Rivera <agustin@rivera-web.com>
Default `openclaw status --json` stays on the lean health-probe path while preserving the JSON task summary, local update/install metadata, explicit probe timeouts, and configured gateway handshake timeouts. Deeper memory, registry, remote git, and local status-RPC diagnostics remain behind `status --json --all`.
Also keeps generated diffs viewer output in its built form and ignores it in oxfmt so `pnpm build` leaves a clean tree.
Proof:
- `node scripts/run-vitest.mjs src/commands/status.scan.fast-json.test.ts src/commands/status-json-payload.test.ts src/commands/status.scan.shared.test.ts`
- `OPENCLAW_LOCAL_CHECK=0 node scripts/run-oxlint-shards.mjs --threads=8`
- `node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo`
- `node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.extensions.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/extensions-test.tsbuildinfo`
- `.agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main`
- GitHub checks green for head `47a63f87ea7c2351994fdb71e8cc18041aa0b64e`
Thanks @andyylin.
Co-authored-by: Andy <andyylin@users.noreply.github.com>
Stabilize isolated cron prompt cache affinity by deriving a stable prompt cache key per cron job/session/model and forwarding it separately from the rotating run session id.
Thread the key through embedded runs, stream resolution, provider options, proxy forwarding, custom streams, and prompt-cache observability. Keep OpenAI-compatible payloads valid by using hyphen-safe keys, clamping upstream prompt_cache_key values, and omitting affinity when cache retention is disabled.
Thanks @ferminquant.
Co-authored-by: Fermin Quant <ferminquant@hotmail.com>
* fix(sessions): preserve Matrix room-id case in session keys (#75670)
Matrix room IDs (and thread event IDs) are opaque, case-sensitive per the
Matrix spec, but session-key canonicalization lowercased them. That forked
one room into duplicate sessions and produced 403 M_FORBIDDEN on recovery /
delivery paths that reconstruct the target from the (lowercased) session key,
even though deliveryContext.to stayed correct.
Introduce a generic, opt-in case-preservation registry (CASE_PRESERVING_PEERS)
consulted at all three lowercasing sites:
- construction: normalizeSessionPeerId
- store canonicalization: normalizeSessionKeyPreservingOpaquePeerIds
- gateway send: explicit request.sessionKey
Signal group preservation is encoded to match prior behavior exactly (segment
span, unscoped, thread suffix still lowercased). Matrix channel/group enrolls
the opaque tail (room id with embedded :server + any 🧵<event> suffix).
Exact mixed-case keys now win over folded legacy aliases in
resolveSessionStoreEntry and delivery-info lookup; existing lowercased rows
collapse on the next write. Matrix DM/MXID and non-enrolled channels keep the
default lowercase behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(sessions): guard Matrix folded alias delivery proof
* test(agents): cover cold OpenAI gpt-5.5 fallback
* fix(sessions): preserve non-opaque alias freshness
* fix(sessions): prevent Matrix cross-room thread recovery
* build(protocol): refresh tools effective Swift models
* test(codex): include effective cwd in startup fixture
* test(codex): align startup failure cleanup expectation
* fix(sessions): keep Signal folded aliases fresh
* fix(sessions): preserve unscoped Matrix room keys
* fix(sessions): recover legacy Matrix thread aliases
* fix(sessions): preserve Matrix keys in state migrations
* fix(sessions): keep Matrix structural alias freshness
* fix(sessions): preserve unscoped Matrix migration keys
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Fix iMessage native exec approval routing so approval prompts bind to the sent GUID without duplicate sends after RPC timeout. Also keeps chat.db GUID recovery on the local imsg path while avoiding local DB recovery for configured or detected SSH wrappers.
Thanks @kevinslin.
Fixes#87191. Keeps Brave and Gemini runtime-injected web search provider config readable by providers without re-exposing legacy tools.web.search provider objects to config validation.
Fix Slack draft cleanup after final-visible delivery.
Track when Slack has already delivered a visible final reply and stop reusing the draft finalizer for later same-turn final/error payloads. This keeps the first fallback cleanup for transient previews while preventing late cleanup from deleting a visible answer.
Fixes#87363
Co-authored-by: tianxiaochannel-oss88 <tianxiaochannel@gmail.com>