fix(process): skip kill-tree group kill when child wasn't detached (#71662) (#71681)

* fix(process): skip kill-tree group kill when child wasn't detached (#71662)

When the supervisor spawns a child with detached:false (service-managed
runtime under launchd/systemd), the child shares the gateway's process
group. On session abort or SIGKILL, killProcessTree was unconditionally
issuing process.kill(-pid, 'SIGTERM') — which targets the entire process
GROUP (negative pid is POSIX group-kill semantics) and therefore
SIGTERMs the gateway parent along with the child.

Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a
claude-cli/claude-opus-4-7 session caused the gateway to receive
SIGTERM, then auto-restart, dropping all in-flight sessions. Switching
the primary model to a non-cli provider eliminated it because the
non-cli paths don't go through this kill-tree call. Did not occur on
Linux VPS where the gateway runs detached, because there
useDetached === true and the child got its own process group.

Fix:
- killProcessTree now accepts opts.detached?: boolean. When detached:false,
  killProcessTreeUnix skips the `-pid` group-kill and goes straight to
  direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is
  preserved so all existing callers behave exactly as before.
- supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached`
  flag into killProcessTree, so the kill-tree path matches the spawn-time
  detachment decision (line 45 of the same file already computes
  useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()).

Tests:
- new: detached:false skips group kill and uses direct pid SIGTERM only.
- new: default behaviour (detached:true) still uses group kill (regression
  guard so the existing test case isn't accidentally weakened).

Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean.

Out of scope: other killProcessTree callers (mcp-stdio-transport,
bash-tools.process, etc.) keep the default group-kill behaviour because
those processes are typically detached from the gateway. Only the
supervisor/adapters/child.ts path threads `detached` through, since it's
the path that knows whether the child was actually spawned detached.

* fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662)

Greptile review on the original PR caught a P1 gap: when
spawnWithFallback's initial detached spawn fails and it retries with the
no-detach fallback (label: "no-detach", options.detached: false), the
child runs detached:false but my variable useDetached was still true.
The kill closure then passed `detached: useDetached` = true to
killProcessTree, which still group-killed the gateway — same bug, just
on the fallback path.

Compute the actual detachment as
`useDetached && !spawned.usedFallback` after spawn returns, and pass
that through. This closes the gap: the kill path now correctly skips
group-kill in BOTH:
1. Service-managed runtime (useDetached=false from the start, original case)
2. Detached-spawn fallback to no-detach (useDetached=true at intent
   time but spawned.usedFallback=true)

Tests:
- existing 'uses process-tree kill for default SIGKILL' updated to
  assert the new {detached} option is forwarded.
- new: passes detached:false to killProcessTree when spawn fell back.
- new: passes detached:false in service-managed mode (regression guard
  for the original fix).

11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
This commit is contained in:
hcl
2026-04-26 05:08:53 +08:00
committed by GitHub
parent 1841dd9977
commit 4a72e1b990
4 changed files with 115 additions and 11 deletions

View File

@@ -139,4 +139,44 @@ describe("killProcessTree", () => {
expect(killSpy).toHaveBeenCalledWith(-4444, "SIGKILL");
});
});
it("on Unix skips group kill when detached:false to avoid SIGTERMing the parent's own process group (#71662)", async () => {
killSpy.mockImplementation(((pid: number, signal?: NodeJS.Signals | number) => {
if (pid === 5555 && signal === 0) {
throw new Error("ESRCH");
}
return true;
}) as typeof process.kill);
await withPlatform("linux", async () => {
killProcessTree(5555, { graceMs: 10, detached: false });
await vi.advanceTimersByTimeAsync(10);
// Direct pid kill is fine. Group kill (`-pid`) is FORBIDDEN here because
// when the child wasn't spawned detached, its process group is the
// gateway's group — `-pid` would SIGTERM the gateway itself.
expect(killSpy).toHaveBeenCalledWith(5555, "SIGTERM");
expect(killSpy).not.toHaveBeenCalledWith(-5555, "SIGTERM");
expect(killSpy).not.toHaveBeenCalledWith(-5555, "SIGKILL");
});
});
it("on Unix uses group kill by default (detached:true preserved as the existing behavior)", async () => {
killSpy.mockImplementation(((pid: number, signal?: NodeJS.Signals | number) => {
if (pid === -6666 && signal === 0) {
throw new Error("ESRCH");
}
if (pid === 6666 && signal === 0) {
throw new Error("ESRCH");
}
return true;
}) as typeof process.kill);
await withPlatform("linux", async () => {
killProcessTree(6666, { graceMs: 10 });
await vi.advanceTimersByTimeAsync(10);
expect(killSpy).toHaveBeenCalledWith(-6666, "SIGTERM");
});
});
});

View File

@@ -11,8 +11,16 @@ const MAX_GRACE_MS = 60_000;
*
* This gives child processes a chance to clean up (close connections, remove
* temp files, terminate their own children) before being hard-killed.
*
* When the child was spawned with `detached: false` (e.g. service-managed
* runtime under launchd/systemd), pass `detached: false` to skip the Unix
* `process.kill(-pid, ...)` group-kill — it would otherwise target the
* gateway's own process group and SIGTERM the gateway itself. (#71662)
*/
export function killProcessTree(pid: number, opts?: { graceMs?: number }): void {
export function killProcessTree(
pid: number,
opts?: { graceMs?: number; detached?: boolean },
): void {
if (!Number.isFinite(pid) || pid <= 0) {
return;
}
@@ -24,7 +32,7 @@ export function killProcessTree(pid: number, opts?: { graceMs?: number }): void
return;
}
killProcessTreeUnix(pid, graceMs);
killProcessTreeUnix(pid, graceMs, opts?.detached !== false);
}
function normalizeGraceMs(value?: number): number {
@@ -43,12 +51,23 @@ function isProcessAlive(pid: number): boolean {
}
}
function killProcessTreeUnix(pid: number, graceMs: number): void {
// Step 1: Try graceful SIGTERM to process group
try {
process.kill(-pid, "SIGTERM");
} catch {
// Process group doesn't exist or we lack permission - try direct
function killProcessTreeUnix(pid: number, graceMs: number, useGroupKill: boolean): void {
// Step 1: Try graceful SIGTERM. Prefer process-group kill (`-pid`) when the
// child was spawned detached so it has its own group; otherwise stick to the
// direct pid to avoid SIGTERMing our own process group (the gateway). (#71662)
if (useGroupKill) {
try {
process.kill(-pid, "SIGTERM");
} catch {
// Process group doesn't exist or we lack permission - try direct
try {
process.kill(pid, "SIGTERM");
} catch {
// Already gone
return;
}
}
} else {
try {
process.kill(pid, "SIGTERM");
} catch {
@@ -59,7 +78,7 @@ function killProcessTreeUnix(pid: number, graceMs: number): void {
// Step 2: Wait grace period, then SIGKILL if still alive
setTimeout(() => {
if (isProcessAlive(-pid)) {
if (useGroupKill && isProcessAlive(-pid)) {
try {
process.kill(-pid, "SIGKILL");
return;

View File

@@ -122,10 +122,46 @@ describe("createChildAdapter", () => {
adapter.kill();
expect(killProcessTreeMock).toHaveBeenCalledWith(4321);
// Detachment flag is now passed to killProcessTree so it knows whether
// it can safely group-kill via -pid. (#71662)
const expectedDetached = process.platform !== "win32" && !process.env.OPENCLAW_SERVICE_MARKER;
expect(killProcessTreeMock).toHaveBeenCalledWith(4321, { detached: expectedDetached });
expect(killMock).toHaveBeenCalledWith("SIGKILL");
});
it("passes detached:false to killProcessTree when spawn fell back to no-detach (#71662 follow-up)", async () => {
// Simulate the fallback scenario: spawnWithFallback retried with
// detached:false because the initial detached spawn failed. The kill
// closure must NOT group-kill since the child shares the gateway's group.
const { child, killMock } = createStubChild(8888);
spawnWithFallbackMock.mockResolvedValue({
child,
usedFallback: true,
fallbackLabel: "no-detach",
});
const adapter = await createChildAdapter({
argv: ["node", "-e", "setTimeout(() => {}, 1000)"],
stdinMode: "pipe-open",
});
adapter.kill();
expect(killProcessTreeMock).toHaveBeenCalledWith(8888, { detached: false });
expect(killMock).toHaveBeenCalledWith("SIGKILL");
});
it("passes detached:false in service-managed mode where useDetached is false from the start (#71662)", async () => {
process.env.OPENCLAW_SERVICE_MARKER = "1";
try {
const { adapter, killMock } = await createAdapterHarness({ pid: 9999 });
adapter.kill();
expect(killProcessTreeMock).toHaveBeenCalledWith(9999, { detached: false });
expect(killMock).toHaveBeenCalledWith("SIGKILL");
} finally {
delete process.env.OPENCLAW_SERVICE_MARKER;
}
});
it("uses direct child.kill for non-SIGKILL signals", async () => {
const { adapter, killMock } = await createAdapterHarness({ pid: 7654 });

View File

@@ -283,11 +283,20 @@ export async function createChildAdapter(params: {
return waitPromise;
};
// The actual detachment of the spawned child can differ from `useDetached`:
// when the detached spawn fails, `spawnWithFallback` retries with the
// `no-detach` fallback (detached:false). In that case the child shares the
// gateway's process group regardless of intent, so the kill must avoid
// group-kill. (#71662 follow-up — caught by Greptile review)
const childIsDetached = useDetached && !spawned.usedFallback;
const kill = (signal?: NodeJS.Signals) => {
const pid = child.pid ?? undefined;
if (signal === undefined || signal === "SIGKILL") {
if (pid) {
killProcessTree(pid);
// Pass through whether the child is actually detached. Without this,
// `killProcessTree` group-kills via `-pid` and takes out the gateway's
// own process group along with the child. (#71662)
killProcessTree(pid, { detached: childIsDetached });
}
try {
child.kill("SIGKILL");