From 9be8d43c3182c2b773bbb25a79a08895320addab Mon Sep 17 00:00:00 2001
From: Peter Steinberger <steipete@gmail.com>
Date: Mon, 27 Apr 2026 00:25:56 +0100
Subject: [PATCH] docs: document installer recovery cleanup

---
 docs/install/updating.md                      |  14 +
 ...exec-duplicate-completion-investigation.md | 133 -----
 docs/refactor/qa.md                           | 540 ------------------
 3 files changed, 14 insertions(+), 673 deletions(-)
 delete mode 100644 docs/refactor/async-exec-duplicate-completion-investigation.md
 delete mode 100644 docs/refactor/qa.md
diff --git a/docs/install/updating.md b/docs/install/updating.md
index 56af3187ebd..e5384bf450b 100644
--- a/docs/install/updating.md
+++ b/docs/install/updating.md
@@ -67,6 +67,20 @@ Add `--no-onboard` to skip onboarding. To force a specific install type through
 the installer, pass `--install-method git --no-onboard` or
 `--install-method npm --no-onboard`.
 
+If `openclaw update` fails after the npm package install phase, re-run the
+installer. The installer does not call the old updater; it runs the global
+package install directly and can recover a partially updated npm install.
+
+```bash
+curl -fsSL https://openclaw.ai/install.sh | bash -s -- --install-method npm
+```
+
+To pin the recovery to a specific version or dist-tag, add `--version`:
+
+```bash
+curl -fsSL https://openclaw.ai/install.sh | bash -s -- --install-method npm --version <version-or-dist-tag>
+```
+
 ## Alternative: manual npm, pnpm, or bun
 
 ```bash
diff --git a/docs/refactor/async-exec-duplicate-completion-investigation.md b/docs/refactor/async-exec-duplicate-completion-investigation.md
deleted file mode 100644
index 8f92ae3ed0c..00000000000
--- a/docs/refactor/async-exec-duplicate-completion-investigation.md
+++ /dev/null
@@ -1,133 +0,0 @@
----
-summary: "Investigation notes for duplicate async exec completion injection"
-read_when:
-  - Debugging repeated node exec completion events
-  - Working on heartbeat/system-event dedupe
-title: "Async exec duplicate completion investigation"
----
-
-## Scope
-
-- Session: `agent:main:telegram:group:-1003774691294:topic:1`
-- Symptom: the same async exec completion for session/run `keen-nexus` was recorded twice in LCM as user turns.
-- Goal: identify whether this is most likely duplicate session injection or plain outbound delivery retry.
-
-## Conclusion
-
-Most likely this is **duplicate session injection**, not a pure outbound delivery retry.
-
-The strongest gateway-side gap is in the **node exec completion path**:
-
-1. A node-side exec finish emits `exec.finished` with the full `runId`.
-2. Gateway `server-node-events` converts that into a system event and requests a heartbeat.
-3. The heartbeat run injects the drained system event block into the agent prompt.
-4. The embedded runner persists that prompt as a new user turn in the session transcript.
-
-If the same `exec.finished` reaches the gateway twice for the same `runId` for any reason (replay, reconnect duplicate, upstream resend, duplicated producer), OpenClaw currently has **no idempotency check keyed by `runId`/`contextKey`** on this path. The second copy will become a second user message with the same content.
-
-## Exact Code Path
-
-### 1. Producer: node exec completion event
-
-- `src/node-host/invoke.ts:340-360`
-  - `sendExecFinishedEvent(...)` emits `node.event` with event `exec.finished`.
-  - Payload includes `sessionKey` and full `runId`.
-
-### 2. Gateway event ingestion
-
-- `src/gateway/server-node-events.ts:574-640`
-  - Handles `exec.finished`.
-  - Builds text:
-    - `Exec finished (node=..., id=<runId>, code ...)`
-  - Enqueues it via:
-    - `enqueueSystemEvent(text, { sessionKey, contextKey: runId ? \`exec:${runId}\` : "exec", trusted: false })`
-  - Immediately requests a wake:
-    - `requestHeartbeatNow(scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event" }))`
-
-### 3. System event dedupe weakness
-
-- `src/infra/system-events.ts:90-115`
-  - `enqueueSystemEvent(...)` only suppresses **consecutive duplicate text**:
-    - `if (entry.lastText === cleaned) return false`
-  - It stores `contextKey`, but does **not** use `contextKey` for idempotency.
-  - After drain, duplicate suppression resets.
-
-This means a replayed `exec.finished` with the same `runId` can be accepted again later, even though the code already had a stable idempotency candidate (`exec:<runId>`).
-
-### 4. Wake handling is not the primary duplicator
-
-- `src/infra/heartbeat-wake.ts:79-117`
-  - Wakes are coalesced by `(agentId, sessionKey)`.
-  - Duplicate wake requests for the same target collapse to one pending wake entry.
-
-This makes **duplicate wake handling alone** a weaker explanation than duplicate event ingestion.
-
-### 5. Heartbeat consumes the event and turns it into prompt input
-
-- `src/infra/heartbeat-runner.ts:535-574`
-  - Preflight peeks pending system events and classifies exec-event runs.
-- `src/auto-reply/reply/session-system-events.ts:86-90`
-  - `drainFormattedSystemEvents(...)` drains the queue for the session.
-- `src/auto-reply/reply/get-reply-run.ts:400-427`
-  - The drained system event block is prepended into the agent prompt body.
-
-### 6. Transcript injection point
-
-- `src/agents/pi-embedded-runner/run/attempt.ts:2000-2017`
-  - `activeSession.prompt(effectivePrompt)` submits the full prompt to the embedded PI session.
-  - That is the point where the completion-derived prompt becomes a persisted user turn.
-
-So once the same system event is rebuilt into the prompt twice, duplicate LCM user messages are expected.
-
-## Why plain outbound delivery retry is less likely
-
-There is a real outbound failure path in the heartbeat runner:
-
-- `src/infra/heartbeat-runner.ts:1194-1242`
-  - The reply is generated first.
-  - Outbound delivery happens later via `deliverOutboundPayloads(...)`.
-  - Failure there returns `{ status: "failed" }`.
-
-However, for the same system event queue entry, this alone is **not sufficient** to explain the duplicate user turns:
-
-- `src/auto-reply/reply/session-system-events.ts:86-90`
-  - The system event queue is already drained before outbound delivery.
-
-So a channel send retry by itself would not recreate the exact same queued event. It could explain missing/failed external delivery, but not by itself a second identical session user message.
-
-## Secondary, lower-confidence possibility
-
-There is a full-run retry loop in the agent runner:
-
-- `src/auto-reply/reply/agent-runner-execution.ts:741-1473`
-  - Certain transient failures can retry the whole run and resubmit the same `commandBody`.
-
-That can duplicate a persisted user prompt **within the same reply execution** if the prompt was already appended before the retry condition triggered.
-
-I rank this lower than duplicate `exec.finished` ingestion because:
-
-- the observed gap was around 51 seconds, which looks more like a second wake/turn than an in-process retry;
-- the report already mentions repeated message send failures, which points more toward a separate later turn than an immediate model/runtime retry.
-
-## Root Cause Hypothesis
-
-Highest-confidence hypothesis:
-
-- The `keen-nexus` completion came through the **node exec event path**.
-- The same `exec.finished` was delivered to `server-node-events` twice.
-- Gateway accepted both because `enqueueSystemEvent(...)` does not dedupe by `contextKey` / `runId`.
-- Each accepted event triggered a heartbeat and was injected as a user turn into the PI transcript.
-
-## Proposed Tiny Surgical Fix
-
-If a fix is wanted, the smallest high-value change is:
-
-- make exec/system-event idempotency honor `contextKey` for a short horizon, at least for exact `(sessionKey, contextKey, text)` repeats;
-- or add a dedicated dedupe in `server-node-events` for `exec.finished` keyed by `(sessionKey, runId, event kind)`.
-
-That would directly block replayed `exec.finished` duplicates before they become session turns.
-
-## Related
-
-- [Exec tool](/tools/exec)
-- [Session management](/concepts/session)
diff --git a/docs/refactor/qa.md b/docs/refactor/qa.md
deleted file mode 100644
index 4770aeafe7a..00000000000
--- a/docs/refactor/qa.md
+++ /dev/null
@@ -1,540 +0,0 @@
----
-summary: "QA refactor plan for scenario catalog and harness consolidation"
-read_when:
-  - Refactoring QA scenario definitions or qa-lab harness code
-  - Moving QA behavior between markdown scenarios and TypeScript harness logic
-title: "QA refactor"
----
-
-Status: foundational migration landed.
-
-## Goal
-
-Move OpenClaw QA from a split-definition model to a single source of truth:
-
-- scenario metadata
-- prompts sent to the model
-- setup and teardown
-- harness logic
-- assertions and success criteria
-- artifacts and report hints
-
-The desired end state is a generic QA harness that loads powerful scenario definition files instead of hardcoding most behavior in TypeScript.
-
-## Current State
-
-Primary source of truth now lives in `qa/scenarios/index.md` plus one file per
-scenario under `qa/scenarios/<theme>/*.md`.
-
-Implemented:
-
-- `qa/scenarios/index.md`
-  - canonical QA pack metadata
-  - operator identity
-  - kickoff mission
-- `qa/scenarios/<theme>/*.md`
-  - one markdown file per scenario
-  - scenario metadata
-  - handler bindings
-  - scenario-specific execution config
-- `extensions/qa-lab/src/scenario-catalog.ts`
-  - markdown pack parser + zod validation
-- `extensions/qa-lab/src/qa-agent-bootstrap.ts`
-  - plan rendering from the markdown pack
-- `extensions/qa-lab/src/qa-agent-workspace.ts`
-  - seeds generated compatibility files plus `QA_SCENARIOS.md`
-- `extensions/qa-lab/src/suite.ts`
-  - selects executable scenarios through markdown-defined handler bindings
-- QA bus protocol + UI
-  - generic inline attachments for image/video/audio/file rendering
-
-Remaining split surfaces:
-
-- `extensions/qa-lab/src/suite.ts`
-  - still owns most executable custom handler logic
-- `extensions/qa-lab/src/report.ts`
-  - still derives report structure from runtime outputs
-
-So the source-of-truth split is fixed, but execution is still mostly handler-backed rather than fully declarative.
-
-## What The Real Scenario Surface Looks Like
-
-Reading the current suite shows a few distinct scenario classes.
-
-### Simple interaction
-
-- channel baseline
-- DM baseline
-- threaded follow-up
-- model switch
-- approval followthrough
-- reaction/edit/delete
-
-### Config and runtime mutation
-
-- config patch skill disable
-- config apply restart wake-up
-- config restart capability flip
-- runtime inventory drift check
-
-### Filesystem and repo assertions
-
-- source/docs discovery report
-- build Lobster Invaders
-- generated image artifact lookup
-
-### Memory orchestration
-
-- memory recall
-- memory tools in channel context
-- memory failure fallback
-- session memory ranking
-- thread memory isolation
-- memory dreaming sweep
-
-### Tool and plugin integration
-
-- MCP plugin-tools call
-- skill visibility
-- skill hot install
-- native image generation
-- image roundtrip
-- image understanding from attachment
-
-### Multi-turn and multi-actor
-
-- subagent handoff
-- subagent fanout synthesis
-- restart recovery style flows
-
-These categories matter because they drive DSL requirements. A flat list of prompt + expected text is not enough.
-
-## Direction
-
-### Single source of truth
-
-Use `qa/scenarios/index.md` plus `qa/scenarios/<theme>/*.md` as the authored
-source of truth.
-
-The pack should stay:
-
-- human-readable in review
-- machine-parseable
-- rich enough to drive:
-  - suite execution
-  - QA workspace bootstrap
-  - QA Lab UI metadata
-  - docs/discovery prompts
-  - report generation
-
-### Preferred authoring format
-
-Use markdown as the top-level format, with structured YAML inside it.
-
-Recommended shape:
-
-- YAML frontmatter
-  - id
-  - title
-  - surface
-  - tags
-  - docs refs
-  - code refs
-  - model/provider overrides
-  - prerequisites
-- prose sections
-  - objective
-  - notes
-  - debugging hints
-- fenced YAML blocks
-  - setup
-  - steps
-  - assertions
-  - cleanup
-
-This gives:
-
-- better PR readability than giant JSON
-- richer context than pure YAML
-- strict parsing and zod validation
-
-Raw JSON is acceptable only as an intermediate generated form.
-
-## Proposed Scenario File Shape
-
-Example:
-
-````md
----
-id: image-generation-roundtrip
-title: Image generation roundtrip
-surface: image
-tags: [media, image, roundtrip]
-models:
-  primary: openai/gpt-5.4
-requires:
-  tools: [image_generate]
-  plugins: [openai, qa-channel]
-docsRefs:
-  - docs/help/testing.md
-  - docs/concepts/model-providers.md
-codeRefs:
-  - extensions/qa-lab/src/suite.ts
-  - src/gateway/chat-attachments.ts
----
-
-# Objective
-
-Verify generated media is reattached on the follow-up turn.
-
-# Setup
-
-```yaml scenario.setup
-- action: config.patch
-  patch:
-    agents:
-      defaults:
-        imageGenerationModel:
-          primary: openai/gpt-image-1
-- action: session.create
-  key: agent:qa:image-roundtrip
-```
-
-# Steps
-
-```yaml scenario.steps
-- action: agent.send
-  session: agent:qa:image-roundtrip
-  message: |
-    Image generation check: generate a QA lighthouse image and summarize it in one short sentence.
-- action: artifact.capture
-  kind: generated-image
-  promptSnippet: Image generation check
-  saveAs: lighthouseImage
-- action: agent.send
-  session: agent:qa:image-roundtrip
-  message: |
-    Roundtrip image inspection check: describe the generated lighthouse attachment in one short sentence.
-  attachments:
-    - fromArtifact: lighthouseImage
-```
-
-# Expect
-
-```yaml scenario.expect
-- assert: outbound.textIncludes
-  value: lighthouse
-- assert: requestLog.matches
-  where:
-    promptIncludes: Roundtrip image inspection check
-  imageInputCountGte: 1
-- assert: artifact.exists
-  ref: lighthouseImage
-```
-````
-
-## Runner Capabilities The DSL Must Cover
-
-Based on the current suite, the generic runner needs more than prompt execution.
-
-### Environment and setup actions
-
-- `bus.reset`
-- `gateway.waitHealthy`
-- `channel.waitReady`
-- `session.create`
-- `thread.create`
-- `workspace.writeSkill`
-
-### Agent turn actions
-
-- `agent.send`
-- `agent.wait`
-- `bus.injectInbound`
-- `bus.injectOutbound`
-
-### Config and runtime actions
-
-- `config.get`
-- `config.patch`
-- `config.apply`
-- `gateway.restart`
-- `tools.effective`
-- `skills.status`
-
-### File and artifact actions
-
-- `file.write`
-- `file.read`
-- `file.delete`
-- `file.touchTime`
-- `artifact.captureGeneratedImage`
-- `artifact.capturePath`
-
-### Memory and cron actions
-
-- `memory.indexForce`
-- `memory.searchCli`
-- `doctor.memory.status`
-- `cron.list`
-- `cron.run`
-- `cron.waitCompletion`
-- `sessionTranscript.write`
-
-### MCP actions
-
-- `mcp.callTool`
-
-### Assertions
-
-- `outbound.textIncludes`
-- `outbound.inThread`
-- `outbound.notInRoot`
-- `tool.called`
-- `tool.notPresent`
-- `skill.visible`
-- `skill.disabled`
-- `file.contains`
-- `memory.contains`
-- `requestLog.matches`
-- `sessionStore.matches`
-- `cron.managedPresent`
-- `artifact.exists`
-
-## Variables and Artifact References
-
-The DSL must support saved outputs and later references.
-
-Examples from the current suite:
-
-- create a thread, then reuse `threadId`
-- create a session, then reuse `sessionKey`
-- generate an image, then attach the file on the next turn
-- generate a wake marker string, then assert that it appears later
-
-Needed capabilities:
-
-- `saveAs`
-- `${vars.name}`
-- `${artifacts.name}`
-- typed references for paths, session keys, thread ids, markers, tool outputs
-
-Without variable support, the harness will keep leaking scenario logic back into TypeScript.
-
-## What Should Stay As Escape Hatches
-
-A fully pure declarative runner is not realistic in phase 1.
-
-Some scenarios are inherently orchestration-heavy:
-
-- memory dreaming sweep
-- config apply restart wake-up
-- config restart capability flip
-- generated image artifact resolution by timestamp/path
-- discovery-report evaluation
-
-These should use explicit custom handlers for now.
-
-Recommended rule:
-
-- 85-90% declarative
-- explicit `customHandler` steps for the hard remainder
-- named and documented custom handlers only
-- no anonymous inline code in the scenario file
-
-That keeps the generic engine clean while still allowing progress.
-
-## Architecture Change
-
-### Current
-
-Scenario markdown already is the source of truth for:
-
-- suite execution
-- workspace bootstrap files
-- QA Lab UI scenario catalog
-- report metadata
-- discovery prompts
-
-Generated compatibility:
-
-- seeded workspace still includes `QA_KICKOFF_TASK.md`
-- seeded workspace still includes `QA_SCENARIO_PLAN.md`
-- seeded workspace now also includes `QA_SCENARIOS.md`
-
-## Refactor Plan
-
-### Phase 1: loader and schema
-
-Done.
-
-- added `qa/scenarios/index.md`
-- split scenarios into `qa/scenarios/<theme>/*.md`
-- added parser for named markdown YAML pack content
-- validated with zod
-- switched consumers to the parsed pack
-- removed repo-level `qa/seed-scenarios.json` and `qa/QA_KICKOFF_TASK.md`
-
-### Phase 2: generic engine
-
-- split `extensions/qa-lab/src/suite.ts` into:
-  - loader
-  - engine
-  - action registry
-  - assertion registry
-  - custom handlers
-- keep existing helper functions as engine operations
-
-Deliverable:
-
-- engine executes simple declarative scenarios
-
-Start with scenarios that are mostly prompt + wait + assert:
-
-- threaded follow-up
-- image understanding from attachment
-- skill visibility and invocation
-- channel baseline
-
-Deliverable:
-
-- first real markdown-defined scenarios shipping through the generic engine
-
-### Phase 4: migrate medium scenarios
-
-- image generation roundtrip
-- memory tools in channel context
-- session memory ranking
-- subagent handoff
-- subagent fanout synthesis
-
-Deliverable:
-
-- variables, artifacts, tool assertions, request-log assertions proven out
-
-### Phase 5: keep hard scenarios on custom handlers
-
-- memory dreaming sweep
-- config apply restart wake-up
-- config restart capability flip
-- runtime inventory drift
-
-Deliverable:
-
-- same authoring format, but with explicit custom-step blocks where needed
-
-### Phase 6: delete hardcoded scenario map
-
-Once the pack coverage is good enough:
-
-- remove most scenario-specific TypeScript branching from `extensions/qa-lab/src/suite.ts`
-
-## Fake Slack / Rich Media Support
-
-The current QA bus is text-first.
-
-Relevant files:
-
-- `extensions/qa-channel/src/protocol.ts`
-- `extensions/qa-lab/src/bus-state.ts`
-- `extensions/qa-lab/src/bus-queries.ts`
-- `extensions/qa-lab/src/bus-server.ts`
-- `extensions/qa-lab/web/src/ui-render.ts`
-
-Today the QA bus supports:
-
-- text
-- reactions
-- threads
-
-It does not yet model inline media attachments.
-
-### Needed transport contract
-
-Add a generic QA bus attachment model:
-
-```ts
-type QaBusAttachment = {
-  id: string;
-  kind: "image" | "video" | "audio" | "file";
-  mimeType: string;
-  fileName?: string;
-  inline?: boolean;
-  url?: string;
-  contentBase64?: string;
-  width?: number;
-  height?: number;
-  durationMs?: number;
-  altText?: string;
-  transcript?: string;
-};
-```
-
-Then add `attachments?: QaBusAttachment[]` to:
-
-- `QaBusMessage`
-- `QaBusInboundMessageInput`
-- `QaBusOutboundMessageInput`
-
-### Why generic first
-
-Do not build a Slack-only media model.
-
-Instead:
-
-- one generic QA transport model
-- multiple renderers on top of it
-  - current QA Lab chat
-  - future fake Slack web
-  - any other fake transport views
-
-This prevents duplicate logic and lets media scenarios stay transport-agnostic.
-
-### UI work needed
-
-Update the QA UI to render:
-
-- inline image preview
-- inline audio player
-- inline video player
-- file attachment chip
-
-The current UI can already render threads and reactions, so attachment rendering should layer onto the same message card model.
-
-### Scenario work enabled by media transport
-
-Once attachments flow through QA bus, we can add richer fake-chat scenarios:
-
-- inline image reply in fake Slack
-- audio attachment understanding
-- video attachment understanding
-- mixed attachment ordering
-- thread reply with media retained
-
-## Recommendation
-
-The next implementation chunk should be:
-
-1. add markdown scenario loader + zod schema
-2. generate the current catalog from markdown
-3. migrate a few simple scenarios first
-4. add generic QA bus attachment support
-5. render inline image in the QA UI
-6. then expand to audio and video
-
-This is the smallest path that proves both goals:
-
-- generic markdown-defined QA
-- richer fake messaging surfaces
-
-## Open Questions
-
-- whether scenario files should allow embedded markdown prompt templates with variable interpolation
-- whether setup/cleanup should be named sections or just ordered action lists
-- whether artifact references should be strongly typed in schema or string-based
-- whether custom handlers should live in one registry or per-surface registries
-- whether the generated JSON compatibility file should remain checked in during migration
-
-## Related
-
-- [QA E2E automation](/concepts/qa-e2e-automation)