Fix three child-process stdin write paths that let async EPIPE errors
escape to uncaughtException and crash the gateway.
extensions/imessage/src/client.ts (the actual #75438 crash path):
- Add child.stdin.on('error') listener in start() to catch async EPIPE
and reject all pending requests via failAll().
- Add write callback to request() stdin.write() that rejects the
specific pending request on error, instead of leaving it hanging
until timeout.
src/agents/mcp-stdio-transport.ts:
- Fix write callback race in send(): previously resolved the promise
immediately when write() returned true, then the write callback with
EPIPE would fire after the promise was already fulfilled. Now always
settles the promise from the write callback so the outcome is known
before resolving.
src/process/exec.ts:
- Add stdin.on('error') before writing input so EPIPE from a
prematurely-exited child is swallowed — the process exit handler
reports the real status.
One reporter observed a gateway crash after 10.5 hours of stable
uptime — a single EPIPE on an iMessage RPC child process stdin write
killed the gateway with code 1.
Fixes: #75438
* fix(process): skip kill-tree group kill when child wasn't detached (#71662)
When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.
Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.
Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
killProcessTreeUnix skips the `-pid` group-kill and goes straight to
direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
flag into killProcessTree, so the kill-tree path matches the spawn-time
detachment decision (line 45 of the same file already computes
useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).
Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
guard so the existing test case isn't accidentally weakened).
Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.
Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.
* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)
Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.
Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
time but spawned.usedFallback=true)
Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
for the original fix).
11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
Raise eligible Linux child processes own oom_score_adj from a child-side /bin/sh exec shim so cgroup memory pressure prefers transient workers over the long-lived gateway. Cover supervisor children, PTY shells, MCP stdio servers, and OpenClaw-launched browser processes through the shared process runtime seam.
Harden the wrapper for distroless images, shell startup env, per-child and process-level opt-outs, dash-compatible exec, and leading-dash command names. Document Linux verification and OOM behavior.
Fixes#70404.
Co-authored-by: Neerav Makwana <261249544+neeravmakwana@users.noreply.github.com>
After a SIGUSR1 in-process restart following an npm upgrade from v2026.4.2
to v2026.4.5, the globalThis singleton created by the old code version
lacks the activeTaskWaiters field added in v2026.4.5. resolveGlobalSingleton
returns the stale object as-is, causing notifyActiveTaskWaiters() to call
Array.from(undefined) and crash the gateway in a loop.
Add a schema migration step in getQueueState() that patches the missing
field on legacy singleton objects. Add a regression test that plants a
v2026.4.2-shaped state object and verifies resetAllLanes() and
waitForActiveTasks() succeed without throwing.
Fixes#61905