# Async Exec Duplicate Completion Investigation ## Scope - Session: `agent:main:telegram:group:-1003774691294:topic:1` - Symptom: the same async exec completion for session/run `keen-nexus` was recorded twice in LCM as user turns. - Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry. ## Conclusion Most likely this is **duplicate session injection**, not a pure outbound delivery retry. The strongest gateway-side gap is in the **node exec completion path**: 1. A node-side exec finish emits `exec.finished` with the full `runId`. 2. Gateway `server-node-events` converts that into a system event and requests a heartbeat. 3. The heartbeat run injects the drained system event block into the agent prompt. 4. The embedded runner persists that prompt as a new user turn in the session transcript. If the same `exec.finished` reaches the gateway twice for the same `runId` for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has **no idempotency check keyed by `runId`/`contextKey`** on this path. The second copy will become a second user message with the same content. ## Exact Code Path ### 1. Producer: node exec completion event - `src/node-host/invoke.ts:340-360` - `sendExecFinishedEvent(...)` emits `node.event` with event `exec.finished`. - Payload includes `sessionKey` and full `runId`. ### 2. Gateway event ingestion - `src/gateway/server-node-events.ts:574-640` - Handles `exec.finished`. - Builds text: - `Exec finished (node=..., id=, code ...)` - Enqueues it via: - `enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \`exec:${runId}\` : "exec", trusted: false })` - Immediately requests a wake: - `requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))` ### 3. System event dedupe weakness - `src/infra/system-events.ts:90-115` - `enqueueSystemEvent(...)` only suppresses **consecutive duplicate text**: - `if (entry.lastText === cleaned) return false` - It stores `contextKey`, but does **not** use `contextKey` for idempotency. - After drain, duplicate suppression resets. This means a replayed `exec.finished` with the same `runId` can be accepted again later, even though the code already had a stable idempotency candidate (`exec:`). ### 4. Wake handling is not the primary duplicator - `src/infra/heartbeat-wake.ts:79-117` - Wakes are coalesced by `(agentId, sessionKey)`. - Duplicate wake requests for the same target collapse to one pending wake entry. This makes **duplicate wake handling alone** a weaker explanation than duplicate event ingestion. ### 5. Heartbeat consumes the event and turns it into prompt input - `src/infra/heartbeat-runner.ts:535-574` - Preflight peeks pending system events and classifies exec-event runs. - `src/auto-reply/reply/session-system-events.ts:86-90` - `drainFormattedSystemEvents(...)` drains the queue for the session. - `src/auto-reply/reply/get-reply-run.ts:400-427` - The drained system event block is prepended into the agent prompt body. ### 6. Transcript injection point - `src/agents/pi-embedded-runner/run/attempt.ts:2000-2017` - `activeSession.prompt(effectivePrompt)` submits the full prompt to the embedded PI session. - That is the point where the completion-derived prompt becomes a persisted user turn. So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected. ## Why plain outbound delivery retry is less likely There is a real outbound failure path in the heartbeat runner: - `src/infra/heartbeat-runner.ts:1194-1242` - The reply is generated first. - Outbound delivery happens later via `deliverOutboundPayloads(...)`. - Failure there returns `{ status: "failed" }`. However, for the same system event queue entry, this alone is **not sufficient** to explain the duplicate user turns: - `src/auto-reply/reply/session-system-events.ts:86-90` - The system event queue is already drained before outbound delivery. So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message. ## Secondary, lower-confidence possibility There is a full-run retry loop in the agent runner: - `src/auto-reply/reply/agent-runner-execution.ts:741-1473` - Certain transient failures can retry the whole run and resubmit the same `commandBody`. That can duplicate a persisted user prompt **within the same reply execution** if the prompt was already appended before the retry condition triggered. I rank this lower than duplicate `exec.finished` ingestion because: - the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry; - the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry. ## Root Cause Hypothesis Highest-confidence hypothesis: - The `keen-nexus` completion came through the **node exec event path**. - The same `exec.finished` was delivered to `server-node-events` twice. - Gateway accepted both because `enqueueSystemEvent(...)` does not dedupe by `contextKey` / `runId`. - Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript. ## Proposed Tiny Surgical Fix If a fix is wanted, the smallest high-value change is: - make exec/system-event idempotency honor `contextKey` for a short horizon, at least for exact `(sessionKey, contextKey, text)` repeats; - or add a dedicated dedupe in `server-node-events` for `exec.finished` keyed by `(sessionKey, runId, event kind)`. That would directly block replayed `exec.finished` duplicates before they become session turns.