Address two additional review concerns:
1. Remove separate 'active' counter from LaneState; derive it from
activeTaskIds.size instead. This makes negative-underflow impossible
— the Set is the single source of truth for active task count.
Previously, a double-reset scenario could drive 'active' negative,
violating the concurrency check in pump().
2. Replace unbounded 'ps -axo pid=,command=' with targeted pgrep
pre-filter in orphan scanner. Only fetches full command info for
candidate PIDs matching 'codex|claude', avoiding O(all-processes)
overhead on large hosts.
Addresses review concern that setHeartbeatWakeHandler() had a surprising
cross-cutting side effect by calling resetAllLanes(), coupling heartbeat
handler registration to command-queue global state.
The lane reset now lives in the restart loop (run-loop.ts and
gateway-daemon.ts), which is the correct abstraction level — only
in-process restart coordinators need to know about stale lane state.
setHeartbeatWakeHandler() still resets its own module-level state
(running, scheduled, timer) which is properly scoped.
* fix(gateway): normalize session key casing to prevent ghost sessions on Linux
On case-sensitive filesystems (Linux), mixed-case session keys like
agent:ops:MySession and agent:ops:mysession resolve to different store
entries, creating ghost duplicates that never converge.
Core changes in session-utils.ts:
- resolveSessionStoreKey: lowercase all session key components
- canonicalizeSpawnedByForAgent: accept cfg, resolve main-alias references
via canonicalizeMainSessionAlias after lowercasing
- loadSessionEntry: return legacyKey only when it differs from canonicalKey
- resolveGatewaySessionStoreTarget: scan store for case-insensitive matches;
add optional scanLegacyKeys param to skip disk reads for read-only callers
- Export findStoreKeysIgnoreCase for use by write-path consumers
- Compare global/unknown sentinels case-insensitively in all canonicalization
functions
sessions-resolve.ts:
- Make resolveSessionKeyFromResolveParams async for inline migration
- Check canonical key first (fast path), then fall back to legacy scan
- Delete ALL legacy case-variant keys in a single updateSessionStore pass
Fixes#12603
* fix(gateway): propagate canonical keys and clean up all case variants on write paths
- agent.ts: use canonicalizeSpawnedByForAgent (with cfg) instead of raw
toLowerCase; use findStoreKeysIgnoreCase to delete all legacy variants
on store write; pass canonicalKey to addChatRun, registerAgentRunContext,
resolveSendPolicy, and agentCommand
- sessions.ts: replace single-key migration with full case-variant cleanup
via findStoreKeysIgnoreCase in patch/reset/delete/compact handlers; add
case-insensitive fallback in preview (store already loaded); make
sessions.resolve handler async; pass scanLegacyKeys: false in preview
- server-node-events.ts: use findStoreKeysIgnoreCase to clean all legacy
variants on voice.transcript and agent.request write paths; pass
canonicalKey to addChatRun and agentCommand
* test(gateway): add session key case-normalization tests
Cover the case-insensitive session key canonicalization logic:
- resolveSessionStoreKey normalizes mixed-case bare and prefixed keys
- resolveSessionStoreKey resolves mixed-case main aliases (MAIN, Main)
- resolveGatewaySessionStoreTarget includes legacy mixed-case store keys
- resolveGatewaySessionStoreTarget collects all case-variant duplicates
- resolveGatewaySessionStoreTarget finds legacy main alias keys with
customized mainKey configuration
All 5 tests fail before the production changes, pass after.
* fix: clean legacy session alias cleanup gaps (openclaw#12846) thanks @mcaxtr
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix(agents): wait for agent idle before flushing pending tool results
When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.
The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.
Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.
Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.
Fixes#8643Fixes#13351
Refs #6682, #12595
* test: add tests for tool result flush race condition
Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received
Refs #8643, #13748
* fix(agents): add waitForIdle to all flushPendingToolResults call sites
The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:
1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)
Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.
Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.
* fix(agents): centralize idle-wait flush and clear timeout handle
---------
Co-authored-by: Renue Development <dev@renuebyscience.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
- Adds `activity`, `status`, `activityType`, and `activityUrl` to Discord provider config schema.
- Implements a `ReadyListener` in `DiscordProvider` to apply these settings on connection.
- Solves the issue where `@buape/carbon` ignores initial presence options in constructor.
- Validated manually and via existing test suite.
* Browser/Security: constrain trace and download output paths to temp roots
* Changelog: remove advisory ID from pre-public security note
* Browser/Security: constrain trace and download output paths to temp roots
* Changelog: remove advisory ID from pre-public security note
* test(bluebubbles): align timeout status expectation to 408
* test(discord): remove unused race-condition counter in threading test
* test(bluebubbles): align timeout status expectation to 408
Fixes#8278
When autoThread is enabled and a thread already exists (user continues
conversation in thread), replies were sometimes routing to the root
channel instead of the thread. This happened because the reply delivery
plan only explicitly set the thread target when a NEW thread was created
(createdThreadId), but not when the message was in an existing thread.
The fix adds a fallback case: when threadChannel is set (we're in an
existing thread) but no new thread was created, explicitly route to
the thread's channel ID. This ensures all thread replies go to the
correct destination.
* feat(gateway): add register and awaitDecision methods to ExecApprovalManager
Separates registration (synchronous) from waiting (async) to allow callers
to confirm registration before the decision is made. Adds grace period for
resolved entries to prevent race conditions.
* feat(gateway): add two-phase response and waitDecision handler for exec approvals
Send immediate 'accepted' response after registration so callers can confirm
the approval ID is valid. Add exec.approval.waitDecision endpoint to wait for
decision on already-registered approvals.
* fix(exec): await approval registration before returning approval-pending
Ensures the approval ID is registered in the gateway before the tool returns.
Uses exec.approval.request with expectFinal:false for registration, then
fire-and-forget exec.approval.waitDecision for the decision phase.
Fixes#2402
* test(gateway): update exec-approval test for two-phase response
Add assertion for immediate 'accepted' response before final decision.
* test(exec): update approval-id test mocks for new two-phase flow
Mock both exec.approval.request (registration) and exec.approval.waitDecision
(decision) calls to match the new internal implementation.
* fix(lint): add cause to errors, use generics instead of type assertions
* fix(exec-approval): guard register() against duplicate IDs
* fix: remove unused timeoutMs param, guard register() against duplicates
* fix(exec-approval): throw on duplicate ID, capture entry in closure
* fix: return error on timeout, remove stale test mock branch
* fix: wrap register() in try/catch, make timeout handling consistent
* fix: update snapshot on timeout, make two-phase response opt-in
* fix: extend grace period to 15s, return 'expired' status
* fix: prevent double-resolve after timeout
* fix: make register() idempotent, capture snapshot before await
* fix(gateway): complete two-phase exec approval wiring
* fix: finalize exec approval race fix (openclaw#3357) thanks @ramin-shirali
* fix(protocol): regenerate exec approval request models (openclaw#3357) thanks @ramin-shirali
* fix(test): remove unused callCount in discord threading test
---------
Co-authored-by: rshirali <rshirali@rshirali-haga.local>
Co-authored-by: rshirali <rshirali@rshirali-haga-1.home>
Co-authored-by: Peter Steinberger <steipete@gmail.com>