fix: dedupe replayed exec.finished node events (#67281)

* docs: add async exec duplicate completion investigation

Add an internal refactor note tracing the node exec completion to system event to heartbeat to transcript path for duplicate async exec injections. Document the most likely gateway-side gap as missing idempotency for replayed exec.finished events, and note why plain outbound delivery retry is a weaker fit for duplicate user turns.

Regeneration-Prompt: |
  Investigate a live duplicate async exec completion that appeared as two identical user turns in an OpenClaw session. Trace the completion path from exec producers into enqueueSystemEvent, heartbeat wake scheduling, prompt assembly, and embedded transcript persistence. Decide whether duplicate wake handling, outbound delivery retry, or duplicate completion event ingestion is the more likely cause, cite the exact code locations, and capture the smallest plausible fix seam without making runtime changes.

* fix: dedupe replayed exec finished node events

Add a narrow idempotency guard in the gateway node-event handler for repeated exec.finished events with the same canonical session key and runId. This blocks replayed async exec completions from being enqueued and heartbeated twice into the parent session. Also only request a heartbeat when the system event was actually queued, and add a regression test for duplicate runId injection.

Regeneration-Prompt: |
  Prevent duplicate async exec completion events from being injected twice into the parent session. Keep the scope tight around the highest-confidence path: node exec.finished events entering gateway server-node-events and becoming system-event-driven heartbeat prompts. Add a small idempotency guard keyed by canonical session plus exec runId, avoid broader delivery or retry changes unless needed, and add regression coverage that fails if the same exec.finished replay is enqueued and woken twice.

* fix: note exec finished replay dedupe
This commit is contained in:
Josh Lehman
2026-04-15 13:06:18 -07:00
committed by GitHub
parent a5dafa27b6
commit 5dcf526a43
4 changed files with 219 additions and 6 deletions

View File

@@ -0,0 +1,122 @@
# Async Exec Duplicate Completion Investigation
## Scope
- Session: `agent:main:telegram:group:-1003774691294:topic:1`
- Symptom: the same async exec completion for session/run `keen-nexus` was recorded twice in LCM as user turns.
- Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry.
## Conclusion
Most likely this is **duplicate session injection**, not a pure outbound delivery retry.
The strongest gateway-side gap is in the **node exec completion path**:
1. A node-side exec finish emits `exec.finished` with the full `runId`.
2. Gateway `server-node-events` converts that into a system event and requests a heartbeat.
3. The heartbeat run injects the drained system event block into the agent prompt.
4. The embedded runner persists that prompt as a new user turn in the session transcript.
If the same `exec.finished` reaches the gateway twice for the same `runId` for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has **no idempotency check keyed by `runId`/`contextKey`** on this path. The second copy will become a second user message with the same content.
## Exact Code Path
### 1. Producer: node exec completion event
- `src/node-host/invoke.ts:340-360`
- `sendExecFinishedEvent(...)` emits `node.event` with event `exec.finished`.
- Payload includes `sessionKey` and full `runId`.
### 2. Gateway event ingestion
- `src/gateway/server-node-events.ts:574-640`
- Handles `exec.finished`.
- Builds text:
- `Exec finished (node=..., id=<runId>, code ...)`
- Enqueues it via:
- `enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \`exec:${runId}\` : "exec", trusted: false })`
- Immediately requests a wake:
- `requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))`
### 3. System event dedupe weakness
- `src/infra/system-events.ts:90-115`
- `enqueueSystemEvent(...)` only suppresses **consecutive duplicate text**:
- `if (entry.lastText === cleaned) return false`
- It stores `contextKey`, but does **not** use `contextKey` for idempotency.
- After drain, duplicate suppression resets.
This means a replayed `exec.finished` with the same `runId` can be accepted again later, even though the code already had a stable idempotency candidate (`exec:<runId>`).
### 4. Wake handling is not the primary duplicator
- `src/infra/heartbeat-wake.ts:79-117`
- Wakes are coalesced by `(agentId, sessionKey)`.
- Duplicate wake requests for the same target collapse to one pending wake entry.
This makes **duplicate wake handling alone** a weaker explanation than duplicate event ingestion.
### 5. Heartbeat consumes the event and turns it into prompt input
- `src/infra/heartbeat-runner.ts:535-574`
- Preflight peeks pending system events and classifies exec-event runs.
- `src/auto-reply/reply/session-system-events.ts:86-90`
- `drainFormattedSystemEvents(...)` drains the queue for the session.
- `src/auto-reply/reply/get-reply-run.ts:400-427`
- The drained system event block is prepended into the agent prompt body.
### 6. Transcript injection point
- `src/agents/pi-embedded-runner/run/attempt.ts:2000-2017`
- `activeSession.prompt(effectivePrompt)` submits the full prompt to the embedded PI session.
- That is the point where the completion-derived prompt becomes a persisted user turn.
So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected.
## Why plain outbound delivery retry is less likely
There is a real outbound failure path in the heartbeat runner:
- `src/infra/heartbeat-runner.ts:1194-1242`
- The reply is generated first.
- Outbound delivery happens later via `deliverOutboundPayloads(...)`.
- Failure there returns `{ status: "failed" }`.
However, for the same system event queue entry, this alone is **not sufficient** to explain the duplicate user turns:
- `src/auto-reply/reply/session-system-events.ts:86-90`
- The system event queue is already drained before outbound delivery.
So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message.
## Secondary, lower-confidence possibility
There is a full-run retry loop in the agent runner:
- `src/auto-reply/reply/agent-runner-execution.ts:741-1473`
- Certain transient failures can retry the whole run and resubmit the same `commandBody`.
That can duplicate a persisted user prompt **within the same reply execution** if the prompt was already appended before the retry condition triggered.
I rank this lower than duplicate `exec.finished` ingestion because:
- the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry;
- the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry.
## Root Cause Hypothesis
Highest-confidence hypothesis:
- The `keen-nexus` completion came through the **node exec event path**.
- The same `exec.finished` was delivered to `server-node-events` twice.
- Gateway accepted both because `enqueueSystemEvent(...)` does not dedupe by `contextKey` / `runId`.
- Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript.
## Proposed Tiny Surgical Fix
If a fix is wanted, the smallest high-value change is:
- make exec/system-event idempotency honor `contextKey` for a short horizon, at least for exact `(sessionKey, contextKey, text)` repeats;
- or add a dedicated dedupe in `server-node-events` for `exec.finished` keyed by `(sessionKey, runId, event kind)`.
That would directly block replayed `exec.finished` duplicates before they become session turns.