fix: force package update restart handoff

2026-05-06 05:40:44 +00:00 · 2026-05-01 09:25:20 +01:00
parent 6efb44944c
commit e131eaecb5
11 changed files with 279 additions and 38 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -53,6 +53,7 @@ Docs: https://docs.openclaw.ai
 - Plugins/runtime-deps: recover interrupted bundled runtime-dependency installs whose package sentinels exist but generated materialization is incomplete, forcing npm/pnpm repair in Gateway startup, doctor, and lazy plugin loads instead of leaving channels crash-looping on missing packages. Fixes #75309; refs #75310, #75296, and #75304. Thanks @scottgl9.
 - Plugins/runtime-deps: treat no-main and export-map package sentinels without reachable entry files as incomplete, so Gateway startup, doctor, and lazy plugin loads repair interrupted bundled dependency installs instead of accepting package.json-only partial installs. Fixes #75309; refs #75183. Thanks @shakkernerd.
 - Plugins/runtime-deps: keep runtime inspection and channel maintenance commands from downloading bundled plugin dependencies, route explicit repairs through `openclaw plugins deps --repair`, and still allow Gateway/DO paths to repair missing deps before import. Refs #75069. Thanks @xiaohuaxi.
+- Updates: force non-deferred update restarts after package-manager updates requested through the live Gateway control plane and fail release validation on post-swap stale chunk import crashes, so Telegram/Discord imports do not stay pointed at removed dist files. Fixes #75206. Thanks @xonaman and @faux123.
 - Agents/tool-result guard: use the resolved runtime context token budget for non-context-engine tool-result overflow checks, so long tool-heavy sessions no longer compact early when `contextTokens` is larger than native `contextWindow`. Fixes #74917. Thanks @kAIborg24.
 - Gateway/systemd: exit with sysexits 78 for supervised lock and `EADDRINUSE` conflicts so `RestartPreventExitStatus=78` stops `Restart=always` restart loops instead of repeatedly reloading plugins against an occupied port. Fixes #75115. Thanks @yhyatt.
 - Agents/runtime: skip blank visible user prompts at the embedded-runner boundary before provider submission while still allowing internal runtime-only turns and media-only prompts, so Telegram/group sessions no longer leak raw empty-input provider errors when replay history exists. Fixes #74137. Thanks @yelog, @Gracker, and @nhaener.
--- a/docs/cli/update.md
+++ b/docs/cli/update.md
@@ -82,7 +82,11 @@ install method aligned:
 - `beta` → prefers npm dist-tag `beta`, but falls back to `latest` when beta is
  missing or older than the current stable release.

-The Gateway core auto-updater (when enabled via config) reuses this same update path.
+The Gateway core auto-updater (when enabled via config) launches the CLI update path
+outside the live Gateway request handler. Control-plane `update.run` package-manager
+updates force a non-deferred update restart after the package swap, because the old
+Gateway process may still have in-memory chunks that point at files removed by the
+new package.

 For package-manager installs, `openclaw update` resolves the target package
 version before invoking the package manager. npm global installs use a staged
@@ -151,7 +155,7 @@ If an exact pinned npm plugin update resolves to an artifact whose integrity dif
 <Note>
 Post-update plugin sync failures fail the update result and stop restart follow-up work. Fix the plugin install or update error, then rerun `openclaw update`.

-When the updated Gateway starts, enabled bundled plugin runtime dependencies are staged before plugin activation. Update-triggered restarts drain any active runtime-dependency staging before closing the Gateway, so service-manager restarts do not interrupt an in-flight npm install.
+When the updated Gateway starts, enabled bundled plugin runtime dependencies are staged before plugin activation. Package-manager `update.run` restarts bypass the normal idle deferral after the package tree has been swapped, so the old process cannot keep lazy-loading removed chunks. Service-manager restarts still drain runtime-dependency staging before closing the Gateway.

 If pnpm bootstrap still fails, the updater stops early with a package-manager-specific error instead of trying `npm run build` inside the checkout.
 </Note>
--- a/docs/gateway/protocol.md
+++ b/docs/gateway/protocol.md
@@ -378,7 +378,7 @@ enumeration of `src/gateway/server-methods/*.ts`.
    - `config.apply` validates + replaces the full config payload.
    - `config.schema` returns the live config schema payload used by Control UI and CLI tooling: schema, `uiHints`, version, and generation metadata, including plugin + channel schema metadata when the runtime can load it. The schema includes field `title` / `description` metadata derived from the same labels and help text used by the UI, including nested object, wildcard, array-item, and `anyOf` / `oneOf` / `allOf` composition branches when matching field documentation exists.
    - `config.schema.lookup` returns a path-scoped lookup payload for one config path: normalized path, a shallow schema node, matched hint + `hintPath`, and immediate child summaries for UI/CLI drill-down. Lookup schema nodes keep the user-facing docs and common validation fields (`title`, `description`, `type`, `enum`, `const`, `format`, `pattern`, numeric/string/array/object bounds, and flags like `additionalProperties`, `deprecated`, `readOnly`, `writeOnly`). Child summaries expose `key`, normalized `path`, `type`, `required`, `hasChildren`, plus the matched `hint` / `hintPath`.
-    - `update.run` runs the gateway update flow and schedules a restart only when the update itself succeeded.
+    - `update.run` runs the gateway update flow and schedules a restart only when the update itself succeeded. Package-manager updates force a non-deferred update restart after the package swap so the old Gateway process does not keep lazy-loading from a replaced `dist` tree.
    - `update.status` returns the latest cached update restart sentinel, including the post-restart running version when available.
    - `wizard.start`, `wizard.next`, `wizard.status`, and `wizard.cancel` expose the onboarding wizard over WS RPC.

--- a/docs/install/updating.md
+++ b/docs/install/updating.md
@@ -168,6 +168,13 @@ The auto-updater is off by default. Enable it in `~/.openclaw/openclaw.json`:
 The gateway also logs an update hint on startup (disable with `update.checkOnStart: false`).
 For downgrade or incident recovery, set `OPENCLAW_NO_AUTO_UPDATE=1` in the gateway environment to block automatic applies even when `update.auto.enabled` is configured. Startup update hints can still run unless `update.checkOnStart` is also disabled.

+Package-manager updates requested through the live Gateway control-plane handler
+force a non-deferred update restart after the package swap. That avoids leaving
+an old in-memory process around long enough to lazy-load chunks from a package
+tree that has already been replaced. Shell `openclaw update` remains the
+preferred path for supervised installs because it can stop and restart the
+service around the update.
+
 ## After updating

 <Steps>
--- a/qa/scenarios/runtime/update-run-package-self-upgrade.md
+++ b/qa/scenarios/runtime/update-run-package-self-upgrade.md
@@ -0,0 +1,119 @@
+# Update run package self-upgrade
+
+```yaml qa-scenario
+id: update-run-package-self-upgrade
+title: Update run package self-upgrade
+surface: runtime
+coverage:
+  primary:
+    - runtime.update-run
+  secondary:
+    - runtime.gateway-restart
+    - runtime.package-update
+objective: Verify an agent can self-update an installed OpenClaw package from 2026.4.26 to latest by using the gateway update.run action, then recover through the forced restart.
+successCriteria:
+  - The agent is explicitly instructed to use the gateway tool action update.run instead of shell package-manager commands.
+  - The update request carries a restart note marker that can be observed after the gateway restart.
+  - Gateway and qa-channel return healthy after update.run restarts the process.
+docsRefs:
+  - docs/cli/update.md
+  - docs/install/updating.md
+  - docs/gateway/protocol.md
+codeRefs:
+  - src/agents/tools/gateway-tool.ts
+  - src/gateway/server-methods/update.ts
+  - src/infra/restart.ts
+execution:
+  kind: flow
+  summary: "Opt-in destructive package-update lane: ask the agent to update a 2026.4.26 install to latest via gateway action update.run and verify the restart marker after recovery."
+  config:
+    requiredProviderMode: live-frontier
+    sourceVersion: "2026.4.26"
+    targetTag: latest
+    allowEnv: OPENCLAW_QA_ALLOW_UPDATE_RUN_SELF
+    channelId: qa-room
+```
+
+```yaml qa-flow
+steps:
+  - name: asks the agent to self-update through update.run
+    actions:
+      - if:
+          expr: "env.gateway.runtimeEnv[config.allowEnv] !== '1'"
+          then:
+            - assert: "true"
+          else:
+            - call: waitForGatewayHealthy
+              args:
+                - ref: env
+                - 60000
+            - call: waitForQaChannelReady
+              args:
+                - ref: env
+                - 60000
+            - call: reset
+            - set: sessionKey
+              value:
+                expr: "buildAgentSessionKey({ agentId: 'qa', channel: 'qa-channel', peer: { kind: 'channel', id: config.channelId } })"
+            - call: createSession
+              args:
+                - ref: env
+                - Update run package self-upgrade
+                - ref: sessionKey
+            - call: readEffectiveTools
+              saveAs: tools
+              args:
+                - ref: env
+                - ref: sessionKey
+            - assert:
+                expr: "tools.has('gateway')"
+                message: gateway tool not present for update.run self-upgrade scenario
+            - set: startIndex
+              value:
+                expr: state.getSnapshot().messages.length
+            - set: marker
+              value:
+                expr: "`QA-UPDATE-RUN-${randomUUID().slice(0, 8)}`"
+            - call: startAgentRun
+              saveAs: started
+              args:
+                - ref: env
+                - sessionKey:
+                    ref: sessionKey
+                  to:
+                    expr: "`channel:${config.channelId}`"
+                  message:
+                    expr: |-
+                      `Update-run self-upgrade QA check. The OpenClaw package under test was installed from openclaw@${config.sourceVersion} and must update itself to openclaw@${config.targetTag}. Use the gateway tool with action=update.run. Do not run npm, pnpm, bun, git pull, or shell package-manager commands yourself. Set note exactly to "${marker} update.run complete" and restartDelayMs to 0 so the post-restart channel message proves recovery.`
+                  timeoutMs:
+                    expr: liveTurnTimeoutMs(env, 180000)
+            - call: waitForGatewayHealthy
+              args:
+                - ref: env
+                - 180000
+            - call: waitForQaChannelReady
+              args:
+                - ref: env
+                - 180000
+            - call: waitForOutboundMessage
+              saveAs: outbound
+              args:
+                - ref: state
+                - lambda:
+                    params: [candidate]
+                    expr: "candidate.text.includes(marker)"
+                - expr: liveTurnTimeoutMs(env, 180000)
+                - sinceIndex:
+                    ref: startIndex
+            - call: env.gateway.call
+              saveAs: updateStatus
+              args:
+                - update.status
+                - {}
+                - timeoutMs: 30000
+            - assert:
+                expr: "Boolean(updateStatus?.sentinel)"
+                message:
+                  expr: "`update.status did not report a restart sentinel after update.run: ${JSON.stringify(updateStatus)}`"
+    detailsExpr: "env.gateway.runtimeEnv[config.allowEnv] !== '1' ? `skipped destructive package self-update; set ${config.allowEnv}=1 to run` : `runId=${started.runId} marker=${marker} outbound=${outbound.text}`"
+```
--- a/scripts/openclaw-cross-os-release-checks.ts
+++ b/scripts/openclaw-cross-os-release-checks.ts
@@ -1256,30 +1256,11 @@ export function buildRealUpdateEnv(env) {
  return updateEnv;
 }

-export function verifyPackagedUpgradeUpdateResult(result, options) {
+export function verifyPackagedUpgradeUpdateResult(result, _options) {
  if (result.exitCode === 0) {
    return;
  }

-  let payload = null;
-  try {
-    payload = JSON.parse(result.stdout);
-  } catch {
-    payload = null;
-  }
-
-  const steps = Array.isArray(payload?.steps) ? payload.steps : [];
-  const allStepsSucceeded = steps.every((step) => step?.exitCode === 0);
-  const afterVersion = typeof payload?.after?.version === "string" ? payload.after.version : "";
-  if (
-    payload?.status === "ok" &&
-    afterVersion === options.candidateVersion &&
-    allStepsSucceeded &&
-    isSelfSwappedPackageProcessExit(result.stderr)
-  ) {
-    return;
-  }
-
  throw new Error(
    `Packaged upgrade failed (${result.exitCode}): ${trimForSummary(
      `${result.stdout}\n${result.stderr}`,
@@ -1287,15 +1268,6 @@ export function verifyPackagedUpgradeUpdateResult(result, options) {
  );
 }

-function isSelfSwappedPackageProcessExit(stderr) {
-  return (
-    typeof stderr === "string" &&
-    stderr.includes("[openclaw] Failed to start CLI:") &&
-    stderr.includes("ERR_MODULE_NOT_FOUND") &&
-    /[\\/]node_modules[\\/]openclaw[\\/]dist[\\/]/u.test(stderr)
-  );
-}
-
 export function resolveExplicitBaselineVersion(baselineSpec) {
  const trimmed = baselineSpec.trim();
  if (!trimmed || trimmed === "openclaw@latest") {
--- a/src/gateway/server-methods/update.test.ts
+++ b/src/gateway/server-methods/update.test.ts
@@ -276,7 +276,34 @@ describe("update.run restart scheduling", () => {
    );
  });

-  it("blocks unmanaged global installs before package mutation when restart is unavailable", async () => {
+  it("forces an immediate restart after successful package-manager updates", async () => {
+    resolveUpdateInstallSurfaceMock.mockResolvedValueOnce({
+      kind: "global",
+      mode: "npm",
+      root: "/tmp/openclaw-global",
+      packageRoot: "/tmp/openclaw-global",
+    });
+
+    let payload:
+      | { ok: boolean; result?: { status?: string; reason?: string; mode?: string } }
+      | undefined;
+
+    await invokeUpdateRun({}, (_ok: boolean, response: unknown) => {
+      payload = response as typeof payload;
+    });
+
+    expect(runGatewayUpdateMock).toHaveBeenCalledTimes(1);
+    expect(scheduleGatewaySigusr1RestartMock).toHaveBeenCalledWith(
+      expect.objectContaining({
+        delayMs: 0,
+        reason: "update.run",
+        skipDeferral: true,
+      }),
+    );
+    expect(payload?.ok).toBe(true);
+  });
+
+  it("blocks global package installs when the gateway cannot restart afterward", async () => {
    isRestartEnabledMock.mockReturnValue(false);
    detectRespawnSupervisorMock.mockReturnValue(null);
    resolveUpdateInstallSurfaceMock.mockResolvedValueOnce({
--- a/src/gateway/server-methods/update.ts
+++ b/src/gateway/server-methods/update.ts
@@ -140,11 +140,13 @@ export const updateHandlers: GatewayRequestHandlers = {
    // Only restart the gateway when the update actually succeeded.
    // Restarting after a failed update leaves the process in a broken state
    // (corrupted node_modules, partial builds) and causes a crash loop.
+    const updateWasPackageSwap = result.status === "ok" && result.mode !== "git";
    const restart =
      result.status === "ok"
        ? scheduleGatewaySigusr1Restart({
-            delayMs: restartDelayMs,
+            delayMs: updateWasPackageSwap ? 0 : restartDelayMs,
            reason: "update.run",
+            skipDeferral: updateWasPackageSwap,
            audit: {
              actor: actor.actor,
              deviceId: actor.deviceId,
--- a/src/infra/infra-runtime.test.ts
+++ b/src/infra/infra-runtime.test.ts
@@ -483,6 +483,85 @@ describe("infra runtime", () => {
      }
    });

+    it("bypasses the pre-restart deferral check when requested", async () => {
+      const emitSpy = vi.spyOn(process, "emit");
+      const pendingCheck = vi.fn(() => 5);
+      const handler = () => {};
+      process.on("SIGUSR1", handler);
+      try {
+        setPreRestartDeferralCheck(pendingCheck);
+        scheduleGatewaySigusr1Restart({
+          delayMs: 0,
+          reason: "update.run",
+          skipDeferral: true,
+        });
+
+        await vi.advanceTimersByTimeAsync(0);
+
+        expect(pendingCheck).not.toHaveBeenCalled();
+        expect(emitSpy).toHaveBeenCalledWith("SIGUSR1");
+        expect(peekGatewaySigusr1RestartReason()).toBe("update.run");
+      } finally {
+        process.removeListener("SIGUSR1", handler);
+      }
+    });
+
+    it("upgrades an already scheduled restart to bypass deferral", async () => {
+      const emitSpy = vi.spyOn(process, "emit");
+      const pendingCheck = vi.fn(() => 5);
+      const handler = () => {};
+      process.on("SIGUSR1", handler);
+      try {
+        setPreRestartDeferralCheck(pendingCheck);
+        scheduleGatewaySigusr1Restart({ delayMs: 1_000, reason: "config.patch" });
+        const forced = scheduleGatewaySigusr1Restart({
+          delayMs: 1_000,
+          reason: "update.run",
+          skipDeferral: true,
+        });
+
+        expect(forced.coalesced).toBe(false);
+
+        await vi.advanceTimersByTimeAsync(1_000);
+
+        expect(pendingCheck).not.toHaveBeenCalled();
+        expect(emitSpy).toHaveBeenCalledWith("SIGUSR1");
+        expect(peekGatewaySigusr1RestartReason()).toBe("update.run");
+      } finally {
+        process.removeListener("SIGUSR1", handler);
+      }
+    });
+
+    it("bypasses an active restart deferral when a forced restart arrives", async () => {
+      const emitSpy = vi.spyOn(process, "emit");
+      const staleBeforeEmit = vi.fn(async () => {});
+      const handler = () => {};
+      process.on("SIGUSR1", handler);
+      try {
+        setPreRestartDeferralCheck(() => 5);
+        scheduleGatewaySigusr1Restart({
+          delayMs: 0,
+          reason: "config.patch",
+          emitHooks: { beforeEmit: staleBeforeEmit },
+        });
+        await vi.advanceTimersByTimeAsync(0);
+        expect(emitSpy).not.toHaveBeenCalledWith("SIGUSR1");
+
+        const forced = scheduleGatewaySigusr1Restart({
+          delayMs: 0,
+          reason: "update.run",
+          skipDeferral: true,
+        });
+
+        expect(forced.coalesced).toBe(false);
+        expect(emitSpy).toHaveBeenCalledWith("SIGUSR1");
+        expect(staleBeforeEmit).not.toHaveBeenCalled();
+        expect(peekGatewaySigusr1RestartReason()).toBe("update.run");
+      } finally {
+        process.removeListener("SIGUSR1", handler);
+      }
+    });
+
    it("emits SIGUSR1 after the default deferral timeout while work is still pending", async () => {
      const emitSpy = vi.spyOn(process, "emit");
      const handler = () => {};
--- a/src/infra/restart.ts
+++ b/src/infra/restart.ts
@@ -44,6 +44,7 @@ let pendingRestartTimer: ReturnType<typeof setTimeout> | null = null;
 let pendingRestartDueAt = 0;
 let pendingRestartReason: string | undefined;
 let pendingRestartEmitHooks: RestartEmitHooks | undefined;
+let pendingRestartSkipDeferral = false;
 let pendingRestartPreparing = false;
 const activeDeferralPolls = new Set<ReturnType<typeof setInterval>>();

@@ -63,6 +64,7 @@ function clearPendingScheduledRestart(): void {
  pendingRestartDueAt = 0;
  pendingRestartReason = undefined;
  pendingRestartEmitHooks = undefined;
+  pendingRestartSkipDeferral = false;
  pendingRestartPreparing = false;
 }

@@ -658,6 +660,7 @@ export function scheduleGatewaySigusr1Restart(opts?: {
  reason?: string;
  audit?: RestartAuditInfo;
  emitHooks?: RestartEmitHooks;
+  skipDeferral?: boolean;
 }): ScheduledRestart {
  const delayMsRaw =
    typeof opts?.delayMs === "number" && Number.isFinite(opts.delayMs)
@@ -673,6 +676,7 @@ export function scheduleGatewaySigusr1Restart(opts?: {
  const nowMs = Date.now();
  const cooldownMsApplied = Math.max(0, lastRestartEmittedAt + RESTART_COOLDOWN_MS - nowMs);
  const requestedDueAt = nowMs + delayMs + cooldownMsApplied;
+  const skipDeferral = opts?.skipDeferral === true;

  if (hasUnconsumedRestartSignal()) {
    if (shouldPreferRestartReason(reason, emittedRestartReason)) {
@@ -695,7 +699,29 @@ export function scheduleGatewaySigusr1Restart(opts?: {

  if (pendingRestartTimer || pendingRestartPreparing) {
    const remainingMs = pendingRestartPreparing ? 0 : Math.max(0, pendingRestartDueAt - nowMs);
-    const shouldPullEarlier = !pendingRestartPreparing && requestedDueAt < pendingRestartDueAt;
+    if (pendingRestartPreparing && skipDeferral && activeDeferralPolls.size > 0) {
+      restartLog.warn(
+        `restart request bypassed active deferral reason=${reason ?? "unspecified"} pendingReason=${pendingRestartReason ?? "unspecified"} ${formatRestartAudit(opts?.audit)}`,
+      );
+      clearActiveDeferralPolls();
+      pendingRestartReason = reason;
+      pendingRestartEmitHooks = opts?.emitHooks;
+      void emitPreparedGatewayRestart(undefined, reason);
+      return {
+        ok: true,
+        pid: process.pid,
+        signal: "SIGUSR1",
+        delayMs: 0,
+        reason,
+        mode,
+        coalesced: false,
+        cooldownMsApplied,
+      };
+    }
+    const shouldUpgradeToSkipDeferral = skipDeferral && !pendingRestartSkipDeferral;
+    const shouldPullEarlier =
+      !pendingRestartPreparing &&
+      (requestedDueAt < pendingRestartDueAt || shouldUpgradeToSkipDeferral);
    if (shouldPullEarlier) {
      restartLog.warn(
        `restart request rescheduled earlier reason=${reason ?? "unspecified"} pendingReason=${pendingRestartReason ?? "unspecified"} oldDelayMs=${remainingMs} newDelayMs=${Math.max(0, requestedDueAt - nowMs)} ${formatRestartAudit(opts?.audit)}`,
@@ -705,6 +731,7 @@ export function scheduleGatewaySigusr1Restart(opts?: {
      if (shouldPreferRestartReason(reason, pendingRestartReason)) {
        pendingRestartReason = reason;
      }
+      pendingRestartSkipDeferral = pendingRestartSkipDeferral || skipDeferral;
      restartLog.warn(
        `restart request coalesced (already scheduled) reason=${reason ?? "unspecified"} pendingReason=${pendingRestartReason ?? "unspecified"} delayMs=${remainingMs} ${formatRestartAudit(opts?.audit)}`,
      );
@@ -725,15 +752,18 @@ export function scheduleGatewaySigusr1Restart(opts?: {
  pendingRestartDueAt = requestedDueAt;
  pendingRestartReason = reason;
  pendingRestartEmitHooks = opts?.emitHooks;
+  pendingRestartSkipDeferral = skipDeferral;
  pendingRestartTimer = setTimeout(
    () => {
      const scheduledReason = pendingRestartReason;
+      const scheduledSkipDeferral = pendingRestartSkipDeferral;
      pendingRestartTimer = null;
      pendingRestartDueAt = 0;
      pendingRestartReason = undefined;
+      pendingRestartSkipDeferral = false;
      pendingRestartPreparing = true;
      const pendingCheck = preRestartCheck;
-      if (!pendingCheck) {
+      if (scheduledSkipDeferral || !pendingCheck) {
        void emitPreparedGatewayRestart(undefined, scheduledReason);
        return;
      }
--- a/test/scripts/openclaw-cross-os-release-checks.test.ts
+++ b/test/scripts/openclaw-cross-os-release-checks.test.ts
@@ -566,7 +566,7 @@ describe("scripts/openclaw-cross-os-release-checks", () => {
    });
  });

-  it("accepts a successful packaged update followed by the old self-swapped process import miss", () => {
+  it("rejects a successful packaged update followed by an old self-swapped process import miss", () => {
    expect(() =>
      verifyPackagedUpgradeUpdateResult(
        {
@@ -581,7 +581,7 @@ describe("scripts/openclaw-cross-os-release-checks", () => {
        },
        { candidateVersion: "2026.4.27" },
      ),
-    ).not.toThrow();
+    ).toThrow(/Packaged upgrade failed/u);
  });

  it("rejects packaged update failures before the candidate package lands", () => {