Files
openclaw/docs/refactor/async-exec-duplicate-completion-investigation.md

6.0 KiB

summary, read_when, title
summary read_when title
Investigation notes for duplicate async exec completion injection
Debugging repeated node exec completion events
Working on heartbeat/system-event dedupe
Async exec duplicate completion investigation

Scope

  • Session: agent:main:telegram:group:-1003774691294:topic:1
  • Symptom: the same async exec completion for session/run keen-nexus was recorded twice in LCM as user turns.
  • Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry.

Conclusion

Most likely this is duplicate session injection, not a pure outbound delivery retry.

The strongest gateway-side gap is in the node exec completion path:

  1. A node-side exec finish emits exec.finished with the full runId.
  2. Gateway server-node-events converts that into a system event and requests a heartbeat.
  3. The heartbeat run injects the drained system event block into the agent prompt.
  4. The embedded runner persists that prompt as a new user turn in the session transcript.

If the same exec.finished reaches the gateway twice for the same runId for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has no idempotency check keyed by runId/contextKey on this path. The second copy will become a second user message with the same content.

Exact Code Path

1. Producer: node exec completion event

  • src/node-host/invoke.ts:340-360
    • sendExecFinishedEvent(...) emits node.event with event exec.finished.
    • Payload includes sessionKey and full runId.

2. Gateway event ingestion

  • src/gateway/server-node-events.ts:574-640
    • Handles exec.finished.
    • Builds text:
      • Exec finished (node=..., id=<runId>, code ...)
    • Enqueues it via:
      • enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \exec:${runId}` : "exec", trusted: false })`
    • Immediately requests a wake:
      • requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))

3. System event dedupe weakness

  • src/infra/system-events.ts:90-115
    • enqueueSystemEvent(...) only suppresses consecutive duplicate text:
      • if (entry.lastText === cleaned) return false
    • It stores contextKey, but does not use contextKey for idempotency.
    • After drain, duplicate suppression resets.

This means a replayed exec.finished with the same runId can be accepted again later, even though the code already had a stable idempotency candidate (exec:<runId>).

4. Wake handling is not the primary duplicator

  • src/infra/heartbeat-wake.ts:79-117
    • Wakes are coalesced by (agentId, sessionKey).
    • Duplicate wake requests for the same target collapse to one pending wake entry.

This makes duplicate wake handling alone a weaker explanation than duplicate event ingestion.

5. Heartbeat consumes the event and turns it into prompt input

  • src/infra/heartbeat-runner.ts:535-574
    • Preflight peeks pending system events and classifies exec-event runs.
  • src/auto-reply/reply/session-system-events.ts:86-90
    • drainFormattedSystemEvents(...) drains the queue for the session.
  • src/auto-reply/reply/get-reply-run.ts:400-427
    • The drained system event block is prepended into the agent prompt body.

6. Transcript injection point

  • src/agents/pi-embedded-runner/run/attempt.ts:2000-2017
    • activeSession.prompt(effectivePrompt) submits the full prompt to the embedded PI session.
    • That is the point where the completion-derived prompt becomes a persisted user turn.

So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected.

Why plain outbound delivery retry is less likely

There is a real outbound failure path in the heartbeat runner:

  • src/infra/heartbeat-runner.ts:1194-1242
    • The reply is generated first.
    • Outbound delivery happens later via deliverOutboundPayloads(...).
    • Failure there returns { status: "failed" }.

However, for the same system event queue entry, this alone is not sufficient to explain the duplicate user turns:

  • src/auto-reply/reply/session-system-events.ts:86-90
    • The system event queue is already drained before outbound delivery.

So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message.

Secondary, lower-confidence possibility

There is a full-run retry loop in the agent runner:

  • src/auto-reply/reply/agent-runner-execution.ts:741-1473
    • Certain transient failures can retry the whole run and resubmit the same commandBody.

That can duplicate a persisted user prompt within the same reply execution if the prompt was already appended before the retry condition triggered.

I rank this lower than duplicate exec.finished ingestion because:

  • the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry;
  • the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry.

Root Cause Hypothesis

Highest-confidence hypothesis:

  • The keen-nexus completion came through the node exec event path.
  • The same exec.finished was delivered to server-node-events twice.
  • Gateway accepted both because enqueueSystemEvent(...) does not dedupe by contextKey / runId.
  • Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript.

Proposed Tiny Surgical Fix

If a fix is wanted, the smallest high-value change is:

  • make exec/system-event idempotency honor contextKey for a short horizon, at least for exact (sessionKey, contextKey, text) repeats;
  • or add a dedicated dedupe in server-node-events for exec.finished keyed by (sessionKey, runId, event kind).

That would directly block replayed exec.finished duplicates before they become session turns.