Files
openclaw/docs/refactor/async-exec-duplicate-completion-investigation.md
Josh Lehman 5dcf526a43 fix: dedupe replayed exec.finished node events (#67281)
* docs: add async exec duplicate completion investigation

Add an internal refactor note tracing the node exec completion to system event to heartbeat to transcript path for duplicate async exec injections. Document the most likely gateway-side gap as missing idempotency for replayed exec.finished events, and note why plain outbound delivery retry is a weaker fit for duplicate user turns.

Regeneration-Prompt: |
  Investigate a live duplicate async exec completion that appeared as two identical user turns in an OpenClaw session. Trace the completion path from exec producers into enqueueSystemEvent, heartbeat wake scheduling, prompt assembly, and embedded transcript persistence. Decide whether duplicate wake handling, outbound delivery retry, or duplicate completion event ingestion is the more likely cause, cite the exact code locations, and capture the smallest plausible fix seam without making runtime changes.

* fix: dedupe replayed exec finished node events

Add a narrow idempotency guard in the gateway node-event handler for repeated exec.finished events with the same canonical session key and runId. This blocks replayed async exec completions from being enqueued and heartbeated twice into the parent session. Also only request a heartbeat when the system event was actually queued, and add a regression test for duplicate runId injection.

Regeneration-Prompt: |
  Prevent duplicate async exec completion events from being injected twice into the parent session. Keep the scope tight around the highest-confidence path: node exec.finished events entering gateway server-node-events and becoming system-event-driven heartbeat prompts. Add a small idempotency guard keyed by canonical session plus exec runId, avoid broader delivery or retry changes unless needed, and add regression coverage that fails if the same exec.finished replay is enqueued and woken twice.

* fix: note exec finished replay dedupe
2026-04-15 13:06:18 -07:00

5.7 KiB

Async Exec Duplicate Completion Investigation

Scope

  • Session: agent:main:telegram:group:-1003774691294:topic:1
  • Symptom: the same async exec completion for session/run keen-nexus was recorded twice in LCM as user turns.
  • Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry.

Conclusion

Most likely this is duplicate session injection, not a pure outbound delivery retry.

The strongest gateway-side gap is in the node exec completion path:

  1. A node-side exec finish emits exec.finished with the full runId.
  2. Gateway server-node-events converts that into a system event and requests a heartbeat.
  3. The heartbeat run injects the drained system event block into the agent prompt.
  4. The embedded runner persists that prompt as a new user turn in the session transcript.

If the same exec.finished reaches the gateway twice for the same runId for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has no idempotency check keyed by runId/contextKey on this path. The second copy will become a second user message with the same content.

Exact Code Path

1. Producer: node exec completion event

  • src/node-host/invoke.ts:340-360
    • sendExecFinishedEvent(...) emits node.event with event exec.finished.
    • Payload includes sessionKey and full runId.

2. Gateway event ingestion

  • src/gateway/server-node-events.ts:574-640
    • Handles exec.finished.
    • Builds text:
      • Exec finished (node=..., id=<runId>, code ...)
    • Enqueues it via:
      • enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \exec:${runId}` : "exec", trusted: false })`
    • Immediately requests a wake:
      • requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))

3. System event dedupe weakness

  • src/infra/system-events.ts:90-115
    • enqueueSystemEvent(...) only suppresses consecutive duplicate text:
      • if (entry.lastText === cleaned) return false
    • It stores contextKey, but does not use contextKey for idempotency.
    • After drain, duplicate suppression resets.

This means a replayed exec.finished with the same runId can be accepted again later, even though the code already had a stable idempotency candidate (exec:<runId>).

4. Wake handling is not the primary duplicator

  • src/infra/heartbeat-wake.ts:79-117
    • Wakes are coalesced by (agentId, sessionKey).
    • Duplicate wake requests for the same target collapse to one pending wake entry.

This makes duplicate wake handling alone a weaker explanation than duplicate event ingestion.

5. Heartbeat consumes the event and turns it into prompt input

  • src/infra/heartbeat-runner.ts:535-574
    • Preflight peeks pending system events and classifies exec-event runs.
  • src/auto-reply/reply/session-system-events.ts:86-90
    • drainFormattedSystemEvents(...) drains the queue for the session.
  • src/auto-reply/reply/get-reply-run.ts:400-427
    • The drained system event block is prepended into the agent prompt body.

6. Transcript injection point

  • src/agents/pi-embedded-runner/run/attempt.ts:2000-2017
    • activeSession.prompt(effectivePrompt) submits the full prompt to the embedded PI session.
    • That is the point where the completion-derived prompt becomes a persisted user turn.

So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected.

Why plain outbound delivery retry is less likely

There is a real outbound failure path in the heartbeat runner:

  • src/infra/heartbeat-runner.ts:1194-1242
    • The reply is generated first.
    • Outbound delivery happens later via deliverOutboundPayloads(...).
    • Failure there returns { status: "failed" }.

However, for the same system event queue entry, this alone is not sufficient to explain the duplicate user turns:

  • src/auto-reply/reply/session-system-events.ts:86-90
    • The system event queue is already drained before outbound delivery.

So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message.

Secondary, lower-confidence possibility

There is a full-run retry loop in the agent runner:

  • src/auto-reply/reply/agent-runner-execution.ts:741-1473
    • Certain transient failures can retry the whole run and resubmit the same commandBody.

That can duplicate a persisted user prompt within the same reply execution if the prompt was already appended before the retry condition triggered.

I rank this lower than duplicate exec.finished ingestion because:

  • the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry;
  • the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry.

Root Cause Hypothesis

Highest-confidence hypothesis:

  • The keen-nexus completion came through the node exec event path.
  • The same exec.finished was delivered to server-node-events twice.
  • Gateway accepted both because enqueueSystemEvent(...) does not dedupe by contextKey / runId.
  • Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript.

Proposed Tiny Surgical Fix

If a fix is wanted, the smallest high-value change is:

  • make exec/system-event idempotency honor contextKey for a short horizon, at least for exact (sessionKey, contextKey, text) repeats;
  • or add a dedicated dedupe in server-node-events for exec.finished keyed by (sessionKey, runId, event kind).

That would directly block replayed exec.finished duplicates before they become session turns.