mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-08 18:20:42 +00:00
134 lines
6.0 KiB
Markdown
134 lines
6.0 KiB
Markdown
---
|
|
summary: "Investigation notes for duplicate async exec completion injection"
|
|
read_when:
|
|
- Debugging repeated node exec completion events
|
|
- Working on heartbeat/system-event dedupe
|
|
title: "Async exec duplicate completion investigation"
|
|
---
|
|
|
|
## Scope
|
|
|
|
- Session: `agent:main:telegram:group:-1003774691294:topic:1`
|
|
- Symptom: the same async exec completion for session/run `keen-nexus` was recorded twice in LCM as user turns.
|
|
- Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry.
|
|
|
|
## Conclusion
|
|
|
|
Most likely this is **duplicate session injection**, not a pure outbound delivery retry.
|
|
|
|
The strongest gateway-side gap is in the **node exec completion path**:
|
|
|
|
1. A node-side exec finish emits `exec.finished` with the full `runId`.
|
|
2. Gateway `server-node-events` converts that into a system event and requests a heartbeat.
|
|
3. The heartbeat run injects the drained system event block into the agent prompt.
|
|
4. The embedded runner persists that prompt as a new user turn in the session transcript.
|
|
|
|
If the same `exec.finished` reaches the gateway twice for the same `runId` for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has **no idempotency check keyed by `runId`/`contextKey`** on this path. The second copy will become a second user message with the same content.
|
|
|
|
## Exact Code Path
|
|
|
|
### 1. Producer: node exec completion event
|
|
|
|
- `src/node-host/invoke.ts:340-360`
|
|
- `sendExecFinishedEvent(...)` emits `node.event` with event `exec.finished`.
|
|
- Payload includes `sessionKey` and full `runId`.
|
|
|
|
### 2. Gateway event ingestion
|
|
|
|
- `src/gateway/server-node-events.ts:574-640`
|
|
- Handles `exec.finished`.
|
|
- Builds text:
|
|
- `Exec finished (node=..., id=<runId>, code ...)`
|
|
- Enqueues it via:
|
|
- `enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \`exec:${runId}\` : "exec", trusted: false })`
|
|
- Immediately requests a wake:
|
|
- `requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))`
|
|
|
|
### 3. System event dedupe weakness
|
|
|
|
- `src/infra/system-events.ts:90-115`
|
|
- `enqueueSystemEvent(...)` only suppresses **consecutive duplicate text**:
|
|
- `if (entry.lastText === cleaned) return false`
|
|
- It stores `contextKey`, but does **not** use `contextKey` for idempotency.
|
|
- After drain, duplicate suppression resets.
|
|
|
|
This means a replayed `exec.finished` with the same `runId` can be accepted again later, even though the code already had a stable idempotency candidate (`exec:<runId>`).
|
|
|
|
### 4. Wake handling is not the primary duplicator
|
|
|
|
- `src/infra/heartbeat-wake.ts:79-117`
|
|
- Wakes are coalesced by `(agentId, sessionKey)`.
|
|
- Duplicate wake requests for the same target collapse to one pending wake entry.
|
|
|
|
This makes **duplicate wake handling alone** a weaker explanation than duplicate event ingestion.
|
|
|
|
### 5. Heartbeat consumes the event and turns it into prompt input
|
|
|
|
- `src/infra/heartbeat-runner.ts:535-574`
|
|
- Preflight peeks pending system events and classifies exec-event runs.
|
|
- `src/auto-reply/reply/session-system-events.ts:86-90`
|
|
- `drainFormattedSystemEvents(...)` drains the queue for the session.
|
|
- `src/auto-reply/reply/get-reply-run.ts:400-427`
|
|
- The drained system event block is prepended into the agent prompt body.
|
|
|
|
### 6. Transcript injection point
|
|
|
|
- `src/agents/pi-embedded-runner/run/attempt.ts:2000-2017`
|
|
- `activeSession.prompt(effectivePrompt)` submits the full prompt to the embedded PI session.
|
|
- That is the point where the completion-derived prompt becomes a persisted user turn.
|
|
|
|
So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected.
|
|
|
|
## Why plain outbound delivery retry is less likely
|
|
|
|
There is a real outbound failure path in the heartbeat runner:
|
|
|
|
- `src/infra/heartbeat-runner.ts:1194-1242`
|
|
- The reply is generated first.
|
|
- Outbound delivery happens later via `deliverOutboundPayloads(...)`.
|
|
- Failure there returns `{ status: "failed" }`.
|
|
|
|
However, for the same system event queue entry, this alone is **not sufficient** to explain the duplicate user turns:
|
|
|
|
- `src/auto-reply/reply/session-system-events.ts:86-90`
|
|
- The system event queue is already drained before outbound delivery.
|
|
|
|
So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message.
|
|
|
|
## Secondary, lower-confidence possibility
|
|
|
|
There is a full-run retry loop in the agent runner:
|
|
|
|
- `src/auto-reply/reply/agent-runner-execution.ts:741-1473`
|
|
- Certain transient failures can retry the whole run and resubmit the same `commandBody`.
|
|
|
|
That can duplicate a persisted user prompt **within the same reply execution** if the prompt was already appended before the retry condition triggered.
|
|
|
|
I rank this lower than duplicate `exec.finished` ingestion because:
|
|
|
|
- the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry;
|
|
- the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry.
|
|
|
|
## Root Cause Hypothesis
|
|
|
|
Highest-confidence hypothesis:
|
|
|
|
- The `keen-nexus` completion came through the **node exec event path**.
|
|
- The same `exec.finished` was delivered to `server-node-events` twice.
|
|
- Gateway accepted both because `enqueueSystemEvent(...)` does not dedupe by `contextKey` / `runId`.
|
|
- Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript.
|
|
|
|
## Proposed Tiny Surgical Fix
|
|
|
|
If a fix is wanted, the smallest high-value change is:
|
|
|
|
- make exec/system-event idempotency honor `contextKey` for a short horizon, at least for exact `(sessionKey, contextKey, text)` repeats;
|
|
- or add a dedicated dedupe in `server-node-events` for `exec.finished` keyed by `(sessionKey, runId, event kind)`.
|
|
|
|
That would directly block replayed `exec.finished` duplicates before they become session turns.
|
|
|
|
## Related
|
|
|
|
- [Exec tool](/tools/exec)
|
|
- [Session management](/concepts/session)
|