6.0 KiB
summary, read_when, title
| summary | read_when | title | ||
|---|---|---|---|---|
| Investigation notes for duplicate async exec completion injection |
|
Async exec duplicate completion investigation |
Scope
- Session:
agent:main:telegram:group:-1003774691294:topic:1 - Symptom: the same async exec completion for session/run
keen-nexuswas recorded twice in LCM as user turns. - Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry.
Conclusion
Most likely this is duplicate session injection, not a pure outbound delivery retry.
The strongest gateway-side gap is in the node exec completion path:
- A node-side exec finish emits
exec.finishedwith the fullrunId. - Gateway
server-node-eventsconverts that into a system event and requests a heartbeat. - The heartbeat run injects the drained system event block into the agent prompt.
- The embedded runner persists that prompt as a new user turn in the session transcript.
If the same exec.finished reaches the gateway twice for the same runId for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has no idempotency check keyed by runId/contextKey on this path. The second copy will become a second user message with the same content.
Exact Code Path
1. Producer: node exec completion event
src/node-host/invoke.ts:340-360sendExecFinishedEvent(...)emitsnode.eventwith eventexec.finished.- Payload includes
sessionKeyand fullrunId.
2. Gateway event ingestion
src/gateway/server-node-events.ts:574-640- Handles
exec.finished. - Builds text:
Exec finished (node=..., id=<runId>, code ...)
- Enqueues it via:
enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \exec:${runId}` : "exec", trusted: false })`
- Immediately requests a wake:
requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))
- Handles
3. System event dedupe weakness
src/infra/system-events.ts:90-115enqueueSystemEvent(...)only suppresses consecutive duplicate text:if (entry.lastText === cleaned) return false
- It stores
contextKey, but does not usecontextKeyfor idempotency. - After drain, duplicate suppression resets.
This means a replayed exec.finished with the same runId can be accepted again later, even though the code already had a stable idempotency candidate (exec:<runId>).
4. Wake handling is not the primary duplicator
src/infra/heartbeat-wake.ts:79-117- Wakes are coalesced by
(agentId, sessionKey). - Duplicate wake requests for the same target collapse to one pending wake entry.
- Wakes are coalesced by
This makes duplicate wake handling alone a weaker explanation than duplicate event ingestion.
5. Heartbeat consumes the event and turns it into prompt input
src/infra/heartbeat-runner.ts:535-574- Preflight peeks pending system events and classifies exec-event runs.
src/auto-reply/reply/session-system-events.ts:86-90drainFormattedSystemEvents(...)drains the queue for the session.
src/auto-reply/reply/get-reply-run.ts:400-427- The drained system event block is prepended into the agent prompt body.
6. Transcript injection point
src/agents/pi-embedded-runner/run/attempt.ts:2000-2017activeSession.prompt(effectivePrompt)submits the full prompt to the embedded PI session.- That is the point where the completion-derived prompt becomes a persisted user turn.
So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected.
Why plain outbound delivery retry is less likely
There is a real outbound failure path in the heartbeat runner:
src/infra/heartbeat-runner.ts:1194-1242- The reply is generated first.
- Outbound delivery happens later via
deliverOutboundPayloads(...). - Failure there returns
{ status: "failed" }.
However, for the same system event queue entry, this alone is not sufficient to explain the duplicate user turns:
src/auto-reply/reply/session-system-events.ts:86-90- The system event queue is already drained before outbound delivery.
So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message.
Secondary, lower-confidence possibility
There is a full-run retry loop in the agent runner:
src/auto-reply/reply/agent-runner-execution.ts:741-1473- Certain transient failures can retry the whole run and resubmit the same
commandBody.
- Certain transient failures can retry the whole run and resubmit the same
That can duplicate a persisted user prompt within the same reply execution if the prompt was already appended before the retry condition triggered.
I rank this lower than duplicate exec.finished ingestion because:
- the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry;
- the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry.
Root Cause Hypothesis
Highest-confidence hypothesis:
- The
keen-nexuscompletion came through the node exec event path. - The same
exec.finishedwas delivered toserver-node-eventstwice. - Gateway accepted both because
enqueueSystemEvent(...)does not dedupe bycontextKey/runId. - Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript.
Proposed Tiny Surgical Fix
If a fix is wanted, the smallest high-value change is:
- make exec/system-event idempotency honor
contextKeyfor a short horizon, at least for exact(sessionKey, contextKey, text)repeats; - or add a dedicated dedupe in
server-node-eventsforexec.finishedkeyed by(sessionKey, runId, event kind).
That would directly block replayed exec.finished duplicates before they become session turns.