From e131eaecb502cbb8afd3008023db721df087fad5 Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Fri, 1 May 2026 09:25:20 +0100 Subject: [PATCH] fix: force package update restart handoff --- CHANGELOG.md | 1 + docs/cli/update.md | 8 +- docs/gateway/protocol.md | 2 +- docs/install/updating.md | 7 ++ .../update-run-package-self-upgrade.md | 119 ++++++++++++++++++ scripts/openclaw-cross-os-release-checks.ts | 30 +---- src/gateway/server-methods/update.test.ts | 29 ++++- src/gateway/server-methods/update.ts | 4 +- src/infra/infra-runtime.test.ts | 79 ++++++++++++ src/infra/restart.ts | 34 ++++- .../openclaw-cross-os-release-checks.test.ts | 4 +- 11 files changed, 279 insertions(+), 38 deletions(-) create mode 100644 qa/scenarios/runtime/update-run-package-self-upgrade.md diff --git a/CHANGELOG.md b/CHANGELOG.md index a29b00f9eed..c117d6ca146 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -53,6 +53,7 @@ Docs: https://docs.openclaw.ai - Plugins/runtime-deps: recover interrupted bundled runtime-dependency installs whose package sentinels exist but generated materialization is incomplete, forcing npm/pnpm repair in Gateway startup, doctor, and lazy plugin loads instead of leaving channels crash-looping on missing packages. Fixes #75309; refs #75310, #75296, and #75304. Thanks @scottgl9. - Plugins/runtime-deps: treat no-main and export-map package sentinels without reachable entry files as incomplete, so Gateway startup, doctor, and lazy plugin loads repair interrupted bundled dependency installs instead of accepting package.json-only partial installs. Fixes #75309; refs #75183. Thanks @shakkernerd. - Plugins/runtime-deps: keep runtime inspection and channel maintenance commands from downloading bundled plugin dependencies, route explicit repairs through `openclaw plugins deps --repair`, and still allow Gateway/DO paths to repair missing deps before import. Refs #75069. Thanks @xiaohuaxi. +- Updates: force non-deferred update restarts after package-manager updates requested through the live Gateway control plane and fail release validation on post-swap stale chunk import crashes, so Telegram/Discord imports do not stay pointed at removed dist files. Fixes #75206. Thanks @xonaman and @faux123. - Agents/tool-result guard: use the resolved runtime context token budget for non-context-engine tool-result overflow checks, so long tool-heavy sessions no longer compact early when `contextTokens` is larger than native `contextWindow`. Fixes #74917. Thanks @kAIborg24. - Gateway/systemd: exit with sysexits 78 for supervised lock and `EADDRINUSE` conflicts so `RestartPreventExitStatus=78` stops `Restart=always` restart loops instead of repeatedly reloading plugins against an occupied port. Fixes #75115. Thanks @yhyatt. - Agents/runtime: skip blank visible user prompts at the embedded-runner boundary before provider submission while still allowing internal runtime-only turns and media-only prompts, so Telegram/group sessions no longer leak raw empty-input provider errors when replay history exists. Fixes #74137. Thanks @yelog, @Gracker, and @nhaener. diff --git a/docs/cli/update.md b/docs/cli/update.md index 0e29c06aac0..e3ed3764cf8 100644 --- a/docs/cli/update.md +++ b/docs/cli/update.md @@ -82,7 +82,11 @@ install method aligned: - `beta` → prefers npm dist-tag `beta`, but falls back to `latest` when beta is missing or older than the current stable release. -The Gateway core auto-updater (when enabled via config) reuses this same update path. +The Gateway core auto-updater (when enabled via config) launches the CLI update path +outside the live Gateway request handler. Control-plane `update.run` package-manager +updates force a non-deferred update restart after the package swap, because the old +Gateway process may still have in-memory chunks that point at files removed by the +new package. For package-manager installs, `openclaw update` resolves the target package version before invoking the package manager. npm global installs use a staged @@ -151,7 +155,7 @@ If an exact pinned npm plugin update resolves to an artifact whose integrity dif Post-update plugin sync failures fail the update result and stop restart follow-up work. Fix the plugin install or update error, then rerun `openclaw update`. -When the updated Gateway starts, enabled bundled plugin runtime dependencies are staged before plugin activation. Update-triggered restarts drain any active runtime-dependency staging before closing the Gateway, so service-manager restarts do not interrupt an in-flight npm install. +When the updated Gateway starts, enabled bundled plugin runtime dependencies are staged before plugin activation. Package-manager `update.run` restarts bypass the normal idle deferral after the package tree has been swapped, so the old process cannot keep lazy-loading removed chunks. Service-manager restarts still drain runtime-dependency staging before closing the Gateway. If pnpm bootstrap still fails, the updater stops early with a package-manager-specific error instead of trying `npm run build` inside the checkout. diff --git a/docs/gateway/protocol.md b/docs/gateway/protocol.md index e243f250941..35459dace87 100644 --- a/docs/gateway/protocol.md +++ b/docs/gateway/protocol.md @@ -378,7 +378,7 @@ enumeration of `src/gateway/server-methods/*.ts`. - `config.apply` validates + replaces the full config payload. - `config.schema` returns the live config schema payload used by Control UI and CLI tooling: schema, `uiHints`, version, and generation metadata, including plugin + channel schema metadata when the runtime can load it. The schema includes field `title` / `description` metadata derived from the same labels and help text used by the UI, including nested object, wildcard, array-item, and `anyOf` / `oneOf` / `allOf` composition branches when matching field documentation exists. - `config.schema.lookup` returns a path-scoped lookup payload for one config path: normalized path, a shallow schema node, matched hint + `hintPath`, and immediate child summaries for UI/CLI drill-down. Lookup schema nodes keep the user-facing docs and common validation fields (`title`, `description`, `type`, `enum`, `const`, `format`, `pattern`, numeric/string/array/object bounds, and flags like `additionalProperties`, `deprecated`, `readOnly`, `writeOnly`). Child summaries expose `key`, normalized `path`, `type`, `required`, `hasChildren`, plus the matched `hint` / `hintPath`. - - `update.run` runs the gateway update flow and schedules a restart only when the update itself succeeded. + - `update.run` runs the gateway update flow and schedules a restart only when the update itself succeeded. Package-manager updates force a non-deferred update restart after the package swap so the old Gateway process does not keep lazy-loading from a replaced `dist` tree. - `update.status` returns the latest cached update restart sentinel, including the post-restart running version when available. - `wizard.start`, `wizard.next`, `wizard.status`, and `wizard.cancel` expose the onboarding wizard over WS RPC. diff --git a/docs/install/updating.md b/docs/install/updating.md index e967c5e6224..bbc8d187848 100644 --- a/docs/install/updating.md +++ b/docs/install/updating.md @@ -168,6 +168,13 @@ The auto-updater is off by default. Enable it in `~/.openclaw/openclaw.json`: The gateway also logs an update hint on startup (disable with `update.checkOnStart: false`). For downgrade or incident recovery, set `OPENCLAW_NO_AUTO_UPDATE=1` in the gateway environment to block automatic applies even when `update.auto.enabled` is configured. Startup update hints can still run unless `update.checkOnStart` is also disabled. +Package-manager updates requested through the live Gateway control-plane handler +force a non-deferred update restart after the package swap. That avoids leaving +an old in-memory process around long enough to lazy-load chunks from a package +tree that has already been replaced. Shell `openclaw update` remains the +preferred path for supervised installs because it can stop and restart the +service around the update. + ## After updating diff --git a/qa/scenarios/runtime/update-run-package-self-upgrade.md b/qa/scenarios/runtime/update-run-package-self-upgrade.md new file mode 100644 index 00000000000..f04499840f2 --- /dev/null +++ b/qa/scenarios/runtime/update-run-package-self-upgrade.md @@ -0,0 +1,119 @@ +# Update run package self-upgrade + +```yaml qa-scenario +id: update-run-package-self-upgrade +title: Update run package self-upgrade +surface: runtime +coverage: + primary: + - runtime.update-run + secondary: + - runtime.gateway-restart + - runtime.package-update +objective: Verify an agent can self-update an installed OpenClaw package from 2026.4.26 to latest by using the gateway update.run action, then recover through the forced restart. +successCriteria: + - The agent is explicitly instructed to use the gateway tool action update.run instead of shell package-manager commands. + - The update request carries a restart note marker that can be observed after the gateway restart. + - Gateway and qa-channel return healthy after update.run restarts the process. +docsRefs: + - docs/cli/update.md + - docs/install/updating.md + - docs/gateway/protocol.md +codeRefs: + - src/agents/tools/gateway-tool.ts + - src/gateway/server-methods/update.ts + - src/infra/restart.ts +execution: + kind: flow + summary: "Opt-in destructive package-update lane: ask the agent to update a 2026.4.26 install to latest via gateway action update.run and verify the restart marker after recovery." + config: + requiredProviderMode: live-frontier + sourceVersion: "2026.4.26" + targetTag: latest + allowEnv: OPENCLAW_QA_ALLOW_UPDATE_RUN_SELF + channelId: qa-room +``` + +```yaml qa-flow +steps: + - name: asks the agent to self-update through update.run + actions: + - if: + expr: "env.gateway.runtimeEnv[config.allowEnv] !== '1'" + then: + - assert: "true" + else: + - call: waitForGatewayHealthy + args: + - ref: env + - 60000 + - call: waitForQaChannelReady + args: + - ref: env + - 60000 + - call: reset + - set: sessionKey + value: + expr: "buildAgentSessionKey({ agentId: 'qa', channel: 'qa-channel', peer: { kind: 'channel', id: config.channelId } })" + - call: createSession + args: + - ref: env + - Update run package self-upgrade + - ref: sessionKey + - call: readEffectiveTools + saveAs: tools + args: + - ref: env + - ref: sessionKey + - assert: + expr: "tools.has('gateway')" + message: gateway tool not present for update.run self-upgrade scenario + - set: startIndex + value: + expr: state.getSnapshot().messages.length + - set: marker + value: + expr: "`QA-UPDATE-RUN-${randomUUID().slice(0, 8)}`" + - call: startAgentRun + saveAs: started + args: + - ref: env + - sessionKey: + ref: sessionKey + to: + expr: "`channel:${config.channelId}`" + message: + expr: |- + `Update-run self-upgrade QA check. The OpenClaw package under test was installed from openclaw@${config.sourceVersion} and must update itself to openclaw@${config.targetTag}. Use the gateway tool with action=update.run. Do not run npm, pnpm, bun, git pull, or shell package-manager commands yourself. Set note exactly to "${marker} update.run complete" and restartDelayMs to 0 so the post-restart channel message proves recovery.` + timeoutMs: + expr: liveTurnTimeoutMs(env, 180000) + - call: waitForGatewayHealthy + args: + - ref: env + - 180000 + - call: waitForQaChannelReady + args: + - ref: env + - 180000 + - call: waitForOutboundMessage + saveAs: outbound + args: + - ref: state + - lambda: + params: [candidate] + expr: "candidate.text.includes(marker)" + - expr: liveTurnTimeoutMs(env, 180000) + - sinceIndex: + ref: startIndex + - call: env.gateway.call + saveAs: updateStatus + args: + - update.status + - {} + - timeoutMs: 30000 + - assert: + expr: "Boolean(updateStatus?.sentinel)" + message: + expr: "`update.status did not report a restart sentinel after update.run: ${JSON.stringify(updateStatus)}`" + detailsExpr: "env.gateway.runtimeEnv[config.allowEnv] !== '1' ? `skipped destructive package self-update; set ${config.allowEnv}=1 to run` : `runId=${started.runId} marker=${marker} outbound=${outbound.text}`" +``` diff --git a/scripts/openclaw-cross-os-release-checks.ts b/scripts/openclaw-cross-os-release-checks.ts index 9ca3773162e..f9c10ba652c 100644 --- a/scripts/openclaw-cross-os-release-checks.ts +++ b/scripts/openclaw-cross-os-release-checks.ts @@ -1256,30 +1256,11 @@ export function buildRealUpdateEnv(env) { return updateEnv; } -export function verifyPackagedUpgradeUpdateResult(result, options) { +export function verifyPackagedUpgradeUpdateResult(result, _options) { if (result.exitCode === 0) { return; } - let payload = null; - try { - payload = JSON.parse(result.stdout); - } catch { - payload = null; - } - - const steps = Array.isArray(payload?.steps) ? payload.steps : []; - const allStepsSucceeded = steps.every((step) => step?.exitCode === 0); - const afterVersion = typeof payload?.after?.version === "string" ? payload.after.version : ""; - if ( - payload?.status === "ok" && - afterVersion === options.candidateVersion && - allStepsSucceeded && - isSelfSwappedPackageProcessExit(result.stderr) - ) { - return; - } - throw new Error( `Packaged upgrade failed (${result.exitCode}): ${trimForSummary( `${result.stdout}\n${result.stderr}`, @@ -1287,15 +1268,6 @@ export function verifyPackagedUpgradeUpdateResult(result, options) { ); } -function isSelfSwappedPackageProcessExit(stderr) { - return ( - typeof stderr === "string" && - stderr.includes("[openclaw] Failed to start CLI:") && - stderr.includes("ERR_MODULE_NOT_FOUND") && - /[\\/]node_modules[\\/]openclaw[\\/]dist[\\/]/u.test(stderr) - ); -} - export function resolveExplicitBaselineVersion(baselineSpec) { const trimmed = baselineSpec.trim(); if (!trimmed || trimmed === "openclaw@latest") { diff --git a/src/gateway/server-methods/update.test.ts b/src/gateway/server-methods/update.test.ts index eacac005bdb..3dd237ffae8 100644 --- a/src/gateway/server-methods/update.test.ts +++ b/src/gateway/server-methods/update.test.ts @@ -276,7 +276,34 @@ describe("update.run restart scheduling", () => { ); }); - it("blocks unmanaged global installs before package mutation when restart is unavailable", async () => { + it("forces an immediate restart after successful package-manager updates", async () => { + resolveUpdateInstallSurfaceMock.mockResolvedValueOnce({ + kind: "global", + mode: "npm", + root: "/tmp/openclaw-global", + packageRoot: "/tmp/openclaw-global", + }); + + let payload: + | { ok: boolean; result?: { status?: string; reason?: string; mode?: string } } + | undefined; + + await invokeUpdateRun({}, (_ok: boolean, response: unknown) => { + payload = response as typeof payload; + }); + + expect(runGatewayUpdateMock).toHaveBeenCalledTimes(1); + expect(scheduleGatewaySigusr1RestartMock).toHaveBeenCalledWith( + expect.objectContaining({ + delayMs: 0, + reason: "update.run", + skipDeferral: true, + }), + ); + expect(payload?.ok).toBe(true); + }); + + it("blocks global package installs when the gateway cannot restart afterward", async () => { isRestartEnabledMock.mockReturnValue(false); detectRespawnSupervisorMock.mockReturnValue(null); resolveUpdateInstallSurfaceMock.mockResolvedValueOnce({ diff --git a/src/gateway/server-methods/update.ts b/src/gateway/server-methods/update.ts index 7c134206f94..dd1cc717320 100644 --- a/src/gateway/server-methods/update.ts +++ b/src/gateway/server-methods/update.ts @@ -140,11 +140,13 @@ export const updateHandlers: GatewayRequestHandlers = { // Only restart the gateway when the update actually succeeded. // Restarting after a failed update leaves the process in a broken state // (corrupted node_modules, partial builds) and causes a crash loop. + const updateWasPackageSwap = result.status === "ok" && result.mode !== "git"; const restart = result.status === "ok" ? scheduleGatewaySigusr1Restart({ - delayMs: restartDelayMs, + delayMs: updateWasPackageSwap ? 0 : restartDelayMs, reason: "update.run", + skipDeferral: updateWasPackageSwap, audit: { actor: actor.actor, deviceId: actor.deviceId, diff --git a/src/infra/infra-runtime.test.ts b/src/infra/infra-runtime.test.ts index 3b508192b2a..c5c67f4d4a1 100644 --- a/src/infra/infra-runtime.test.ts +++ b/src/infra/infra-runtime.test.ts @@ -483,6 +483,85 @@ describe("infra runtime", () => { } }); + it("bypasses the pre-restart deferral check when requested", async () => { + const emitSpy = vi.spyOn(process, "emit"); + const pendingCheck = vi.fn(() => 5); + const handler = () => {}; + process.on("SIGUSR1", handler); + try { + setPreRestartDeferralCheck(pendingCheck); + scheduleGatewaySigusr1Restart({ + delayMs: 0, + reason: "update.run", + skipDeferral: true, + }); + + await vi.advanceTimersByTimeAsync(0); + + expect(pendingCheck).not.toHaveBeenCalled(); + expect(emitSpy).toHaveBeenCalledWith("SIGUSR1"); + expect(peekGatewaySigusr1RestartReason()).toBe("update.run"); + } finally { + process.removeListener("SIGUSR1", handler); + } + }); + + it("upgrades an already scheduled restart to bypass deferral", async () => { + const emitSpy = vi.spyOn(process, "emit"); + const pendingCheck = vi.fn(() => 5); + const handler = () => {}; + process.on("SIGUSR1", handler); + try { + setPreRestartDeferralCheck(pendingCheck); + scheduleGatewaySigusr1Restart({ delayMs: 1_000, reason: "config.patch" }); + const forced = scheduleGatewaySigusr1Restart({ + delayMs: 1_000, + reason: "update.run", + skipDeferral: true, + }); + + expect(forced.coalesced).toBe(false); + + await vi.advanceTimersByTimeAsync(1_000); + + expect(pendingCheck).not.toHaveBeenCalled(); + expect(emitSpy).toHaveBeenCalledWith("SIGUSR1"); + expect(peekGatewaySigusr1RestartReason()).toBe("update.run"); + } finally { + process.removeListener("SIGUSR1", handler); + } + }); + + it("bypasses an active restart deferral when a forced restart arrives", async () => { + const emitSpy = vi.spyOn(process, "emit"); + const staleBeforeEmit = vi.fn(async () => {}); + const handler = () => {}; + process.on("SIGUSR1", handler); + try { + setPreRestartDeferralCheck(() => 5); + scheduleGatewaySigusr1Restart({ + delayMs: 0, + reason: "config.patch", + emitHooks: { beforeEmit: staleBeforeEmit }, + }); + await vi.advanceTimersByTimeAsync(0); + expect(emitSpy).not.toHaveBeenCalledWith("SIGUSR1"); + + const forced = scheduleGatewaySigusr1Restart({ + delayMs: 0, + reason: "update.run", + skipDeferral: true, + }); + + expect(forced.coalesced).toBe(false); + expect(emitSpy).toHaveBeenCalledWith("SIGUSR1"); + expect(staleBeforeEmit).not.toHaveBeenCalled(); + expect(peekGatewaySigusr1RestartReason()).toBe("update.run"); + } finally { + process.removeListener("SIGUSR1", handler); + } + }); + it("emits SIGUSR1 after the default deferral timeout while work is still pending", async () => { const emitSpy = vi.spyOn(process, "emit"); const handler = () => {}; diff --git a/src/infra/restart.ts b/src/infra/restart.ts index 56b6a906f66..953fcbc3503 100644 --- a/src/infra/restart.ts +++ b/src/infra/restart.ts @@ -44,6 +44,7 @@ let pendingRestartTimer: ReturnType | null = null; let pendingRestartDueAt = 0; let pendingRestartReason: string | undefined; let pendingRestartEmitHooks: RestartEmitHooks | undefined; +let pendingRestartSkipDeferral = false; let pendingRestartPreparing = false; const activeDeferralPolls = new Set>(); @@ -63,6 +64,7 @@ function clearPendingScheduledRestart(): void { pendingRestartDueAt = 0; pendingRestartReason = undefined; pendingRestartEmitHooks = undefined; + pendingRestartSkipDeferral = false; pendingRestartPreparing = false; } @@ -658,6 +660,7 @@ export function scheduleGatewaySigusr1Restart(opts?: { reason?: string; audit?: RestartAuditInfo; emitHooks?: RestartEmitHooks; + skipDeferral?: boolean; }): ScheduledRestart { const delayMsRaw = typeof opts?.delayMs === "number" && Number.isFinite(opts.delayMs) @@ -673,6 +676,7 @@ export function scheduleGatewaySigusr1Restart(opts?: { const nowMs = Date.now(); const cooldownMsApplied = Math.max(0, lastRestartEmittedAt + RESTART_COOLDOWN_MS - nowMs); const requestedDueAt = nowMs + delayMs + cooldownMsApplied; + const skipDeferral = opts?.skipDeferral === true; if (hasUnconsumedRestartSignal()) { if (shouldPreferRestartReason(reason, emittedRestartReason)) { @@ -695,7 +699,29 @@ export function scheduleGatewaySigusr1Restart(opts?: { if (pendingRestartTimer || pendingRestartPreparing) { const remainingMs = pendingRestartPreparing ? 0 : Math.max(0, pendingRestartDueAt - nowMs); - const shouldPullEarlier = !pendingRestartPreparing && requestedDueAt < pendingRestartDueAt; + if (pendingRestartPreparing && skipDeferral && activeDeferralPolls.size > 0) { + restartLog.warn( + `restart request bypassed active deferral reason=${reason ?? "unspecified"} pendingReason=${pendingRestartReason ?? "unspecified"} ${formatRestartAudit(opts?.audit)}`, + ); + clearActiveDeferralPolls(); + pendingRestartReason = reason; + pendingRestartEmitHooks = opts?.emitHooks; + void emitPreparedGatewayRestart(undefined, reason); + return { + ok: true, + pid: process.pid, + signal: "SIGUSR1", + delayMs: 0, + reason, + mode, + coalesced: false, + cooldownMsApplied, + }; + } + const shouldUpgradeToSkipDeferral = skipDeferral && !pendingRestartSkipDeferral; + const shouldPullEarlier = + !pendingRestartPreparing && + (requestedDueAt < pendingRestartDueAt || shouldUpgradeToSkipDeferral); if (shouldPullEarlier) { restartLog.warn( `restart request rescheduled earlier reason=${reason ?? "unspecified"} pendingReason=${pendingRestartReason ?? "unspecified"} oldDelayMs=${remainingMs} newDelayMs=${Math.max(0, requestedDueAt - nowMs)} ${formatRestartAudit(opts?.audit)}`, @@ -705,6 +731,7 @@ export function scheduleGatewaySigusr1Restart(opts?: { if (shouldPreferRestartReason(reason, pendingRestartReason)) { pendingRestartReason = reason; } + pendingRestartSkipDeferral = pendingRestartSkipDeferral || skipDeferral; restartLog.warn( `restart request coalesced (already scheduled) reason=${reason ?? "unspecified"} pendingReason=${pendingRestartReason ?? "unspecified"} delayMs=${remainingMs} ${formatRestartAudit(opts?.audit)}`, ); @@ -725,15 +752,18 @@ export function scheduleGatewaySigusr1Restart(opts?: { pendingRestartDueAt = requestedDueAt; pendingRestartReason = reason; pendingRestartEmitHooks = opts?.emitHooks; + pendingRestartSkipDeferral = skipDeferral; pendingRestartTimer = setTimeout( () => { const scheduledReason = pendingRestartReason; + const scheduledSkipDeferral = pendingRestartSkipDeferral; pendingRestartTimer = null; pendingRestartDueAt = 0; pendingRestartReason = undefined; + pendingRestartSkipDeferral = false; pendingRestartPreparing = true; const pendingCheck = preRestartCheck; - if (!pendingCheck) { + if (scheduledSkipDeferral || !pendingCheck) { void emitPreparedGatewayRestart(undefined, scheduledReason); return; } diff --git a/test/scripts/openclaw-cross-os-release-checks.test.ts b/test/scripts/openclaw-cross-os-release-checks.test.ts index 8c6f214143b..a38f6b71150 100644 --- a/test/scripts/openclaw-cross-os-release-checks.test.ts +++ b/test/scripts/openclaw-cross-os-release-checks.test.ts @@ -566,7 +566,7 @@ describe("scripts/openclaw-cross-os-release-checks", () => { }); }); - it("accepts a successful packaged update followed by the old self-swapped process import miss", () => { + it("rejects a successful packaged update followed by an old self-swapped process import miss", () => { expect(() => verifyPackagedUpgradeUpdateResult( { @@ -581,7 +581,7 @@ describe("scripts/openclaw-cross-os-release-checks", () => { }, { candidateVersion: "2026.4.27" }, ), - ).not.toThrow(); + ).toThrow(/Packaged upgrade failed/u); }); it("rejects packaged update failures before the candidate package lands", () => {