* Harden Windows command wrapper resolution
* clawsweeper: route Windows cmd.exe wrapper through getWindowsInstallRoots
Replace the local SystemRoot/windir/SYSTEMROOT/WINDIR scan in
resolveTrustedWindowsCmdExe with the shared getWindowsInstallRoots()
resolver from src/infra/windows-install-roots.ts. The shared resolver
already rejects UNC paths, root-relative values, semicolon-delimited
path-lists, and missing-drive-letter roots, and prefers registry-derived
roots over env, so the wrapper-launch trust boundary now matches the
existing Windows install-root boundary on main.
Tests:
- _resetWindowsInstallRootsForTests in beforeEach so cached roots track
per-test process.env mutations
- expectedTrustedCmdExe helper now joins the resolved systemRoot, so the
expected wrapper executable matches the production resolver on Linux
CI (where it falls back to DEFAULT_WINDOWS_SYSTEM_ROOT)
- new "rejects unsafe Windows root values" test covers UNC,
semicolon-delimited path-list, root-relative, and bare-relative
SystemRoot inputs
* Add CHANGELOG entry for #77472 Windows command wrapper hardening
* clawsweeper: stub registry probe in Windows wrapper tests
On real Windows CI runners getWindowsInstallRoots() reads the canonical
SystemRoot from the registry (e.g. C:\WINDOWS) before falling back to
process.env, which shadowed the env-only setup in the ComSpec-poisoning
and unsafe-root tests and produced casing mismatches like
"C:\WINDOWS\System32\cmd.exe" vs the expected "C:\Windows\...". Pass a
queryRegistryValue stub returning null in beforeEach (and inside the
unsafe-root loop) so install-root resolution is fully driven by the
test's process.env setup on every platform.
* clawsweeper: overwrite WINDIR alongside SystemRoot in unsafe-root test
Real Windows runners did not honor `delete process.env.windir`, so the
unsafe-root iteration's WINDIR fallback still resolved to the canonical
`C:\WINDOWS` and produced a casing mismatch against the expected default
`C:\Windows\System32\cmd.exe`. Set both `SystemRoot` and `WINDIR` to the
unsafe payload so every install-root env source is rejected by
`normalizeWindowsInstallRoot` and the resolver falls through to
`DEFAULT_WINDOWS_SYSTEM_ROOT`.
Fix three child-process stdin write paths that let async EPIPE errors
escape to uncaughtException and crash the gateway.
extensions/imessage/src/client.ts (the actual #75438 crash path):
- Add child.stdin.on('error') listener in start() to catch async EPIPE
and reject all pending requests via failAll().
- Add write callback to request() stdin.write() that rejects the
specific pending request on error, instead of leaving it hanging
until timeout.
src/agents/mcp-stdio-transport.ts:
- Fix write callback race in send(): previously resolved the promise
immediately when write() returned true, then the write callback with
EPIPE would fire after the promise was already fulfilled. Now always
settles the promise from the write callback so the outcome is known
before resolving.
src/process/exec.ts:
- Add stdin.on('error') before writing input so EPIPE from a
prematurely-exited child is swallowed — the process exit handler
reports the real status.
One reporter observed a gateway crash after 10.5 hours of stable
uptime — a single EPIPE on an iMessage RPC child process stdin write
killed the gateway with code 1.
Fixes: #75438
* fix(process): skip kill-tree group kill when child wasn't detached (#71662)
When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.
Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.
Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
killProcessTreeUnix skips the `-pid` group-kill and goes straight to
direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
flag into killProcessTree, so the kill-tree path matches the spawn-time
detachment decision (line 45 of the same file already computes
useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).
Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
guard so the existing test case isn't accidentally weakened).
Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.
Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.
* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)
Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.
Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
time but spawned.usedFallback=true)
Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
for the original fix).
11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
Raise eligible Linux child processes own oom_score_adj from a child-side /bin/sh exec shim so cgroup memory pressure prefers transient workers over the long-lived gateway. Cover supervisor children, PTY shells, MCP stdio servers, and OpenClaw-launched browser processes through the shared process runtime seam.
Harden the wrapper for distroless images, shell startup env, per-child and process-level opt-outs, dash-compatible exec, and leading-dash command names. Document Linux verification and OOM behavior.
Fixes#70404.
Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
After a SIGUSR1 in-process restart following an npm upgrade from v2026.4.2
to v2026.4.5, the globalThis singleton created by the old code version
lacks the activeTaskWaiters field added in v2026.4.5. resolveGlobalSingleton
returns the stale object as-is, causing notifyActiveTaskWaiters() to call
Array.from(undefined) and crash the gateway in a loop.
Add a schema migration step in getQueueState() that patches the missing
field on legacy singleton objects. Add a regression test that plants a
v2026.4.2-shaped state object and verifies resetAllLanes() and
waitForActiveTasks() succeed without throwing.
Fixes#61905