mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 10:20:42 +00:00
* fix(process): skip kill-tree group kill when child wasn't detached (#71662) When the supervisor spawns a child with detached:false (service-managed runtime under launchd/systemd), the child shares the gateway's process group. On session abort or SIGKILL, killProcessTree was unconditionally issuing process.kill(-pid, 'SIGTERM') — which targets the entire process GROUP (negative pid is POSIX group-kill semantics) and therefore SIGTERMs the gateway parent along with the child. Reporter saw this on macOS (LaunchAgent + KeepAlive=true): aborting a claude-cli/claude-opus-4-7 session caused the gateway to receive SIGTERM, then auto-restart, dropping all in-flight sessions. Switching the primary model to a non-cli provider eliminated it because the non-cli paths don't go through this kill-tree call. Did not occur on Linux VPS where the gateway runs detached, because there useDetached === true and the child got its own process group. Fix: - killProcessTree now accepts opts.detached?: boolean. When detached:false, killProcessTreeUnix skips the `-pid` group-kill and goes straight to direct-pid SIGTERM/SIGKILL. Group-kill default (detached:true) is preserved so all existing callers behave exactly as before. - supervisor/adapters/child.ts:286 now threads the spawn-time `useDetached` flag into killProcessTree, so the kill-tree path matches the spawn-time detachment decision (line 45 of the same file already computes useDetached = process.platform !== 'win32' && !isServiceManagedRuntime()). Tests: - new: detached:false skips group kill and uses direct pid SIGTERM only. - new: default behaviour (detached:true) still uses group kill (regression guard so the existing test case isn't accidentally weakened). Existing tests still pass (6/6 in kill-tree.test.ts). Lint clean. Out of scope: other killProcessTree callers (mcp-stdio-transport, bash-tools.process, etc.) keep the default group-kill behaviour because those processes are typically detached from the gateway. Only the supervisor/adapters/child.ts path threads `detached` through, since it's the path that knows whether the child was actually spawned detached. * fixup(process): also gate kill-tree group-kill on the no-detach spawn fallback (#71662) Greptile review on the original PR caught a P1 gap: when spawnWithFallback's initial detached spawn fails and it retries with the no-detach fallback (label: "no-detach", options.detached: false), the child runs detached:false but my variable useDetached was still true. The kill closure then passed `detached: useDetached` = true to killProcessTree, which still group-killed the gateway — same bug, just on the fallback path. Compute the actual detachment as `useDetached && !spawned.usedFallback` after spawn returns, and pass that through. This closes the gap: the kill path now correctly skips group-kill in BOTH: 1. Service-managed runtime (useDetached=false from the start, original case) 2. Detached-spawn fallback to no-detach (useDetached=true at intent time but spawned.usedFallback=true) Tests: - existing 'uses process-tree kill for default SIGKILL' updated to assert the new {detached} option is forwarded. - new: passes detached:false to killProcessTree when spawn fell back. - new: passes detached:false in service-managed mode (regression guard for the original fix). 11/11 tests pass in child.test.ts. 6/6 in kill-tree.test.ts.
This commit is contained in:
@@ -139,4 +139,44 @@ describe("killProcessTree", () => {
|
||||
expect(killSpy).toHaveBeenCalledWith(-4444, "SIGKILL");
|
||||
});
|
||||
});
|
||||
|
||||
it("on Unix skips group kill when detached:false to avoid SIGTERMing the parent's own process group (#71662)", async () => {
|
||||
killSpy.mockImplementation(((pid: number, signal?: NodeJS.Signals | number) => {
|
||||
if (pid === 5555 && signal === 0) {
|
||||
throw new Error("ESRCH");
|
||||
}
|
||||
return true;
|
||||
}) as typeof process.kill);
|
||||
|
||||
await withPlatform("linux", async () => {
|
||||
killProcessTree(5555, { graceMs: 10, detached: false });
|
||||
await vi.advanceTimersByTimeAsync(10);
|
||||
|
||||
// Direct pid kill is fine. Group kill (`-pid`) is FORBIDDEN here because
|
||||
// when the child wasn't spawned detached, its process group is the
|
||||
// gateway's group — `-pid` would SIGTERM the gateway itself.
|
||||
expect(killSpy).toHaveBeenCalledWith(5555, "SIGTERM");
|
||||
expect(killSpy).not.toHaveBeenCalledWith(-5555, "SIGTERM");
|
||||
expect(killSpy).not.toHaveBeenCalledWith(-5555, "SIGKILL");
|
||||
});
|
||||
});
|
||||
|
||||
it("on Unix uses group kill by default (detached:true preserved as the existing behavior)", async () => {
|
||||
killSpy.mockImplementation(((pid: number, signal?: NodeJS.Signals | number) => {
|
||||
if (pid === -6666 && signal === 0) {
|
||||
throw new Error("ESRCH");
|
||||
}
|
||||
if (pid === 6666 && signal === 0) {
|
||||
throw new Error("ESRCH");
|
||||
}
|
||||
return true;
|
||||
}) as typeof process.kill);
|
||||
|
||||
await withPlatform("linux", async () => {
|
||||
killProcessTree(6666, { graceMs: 10 });
|
||||
await vi.advanceTimersByTimeAsync(10);
|
||||
|
||||
expect(killSpy).toHaveBeenCalledWith(-6666, "SIGTERM");
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
@@ -11,8 +11,16 @@ const MAX_GRACE_MS = 60_000;
|
||||
*
|
||||
* This gives child processes a chance to clean up (close connections, remove
|
||||
* temp files, terminate their own children) before being hard-killed.
|
||||
*
|
||||
* When the child was spawned with `detached: false` (e.g. service-managed
|
||||
* runtime under launchd/systemd), pass `detached: false` to skip the Unix
|
||||
* `process.kill(-pid, ...)` group-kill — it would otherwise target the
|
||||
* gateway's own process group and SIGTERM the gateway itself. (#71662)
|
||||
*/
|
||||
export function killProcessTree(pid: number, opts?: { graceMs?: number }): void {
|
||||
export function killProcessTree(
|
||||
pid: number,
|
||||
opts?: { graceMs?: number; detached?: boolean },
|
||||
): void {
|
||||
if (!Number.isFinite(pid) || pid <= 0) {
|
||||
return;
|
||||
}
|
||||
@@ -24,7 +32,7 @@ export function killProcessTree(pid: number, opts?: { graceMs?: number }): void
|
||||
return;
|
||||
}
|
||||
|
||||
killProcessTreeUnix(pid, graceMs);
|
||||
killProcessTreeUnix(pid, graceMs, opts?.detached !== false);
|
||||
}
|
||||
|
||||
function normalizeGraceMs(value?: number): number {
|
||||
@@ -43,12 +51,23 @@ function isProcessAlive(pid: number): boolean {
|
||||
}
|
||||
}
|
||||
|
||||
function killProcessTreeUnix(pid: number, graceMs: number): void {
|
||||
// Step 1: Try graceful SIGTERM to process group
|
||||
try {
|
||||
process.kill(-pid, "SIGTERM");
|
||||
} catch {
|
||||
// Process group doesn't exist or we lack permission - try direct
|
||||
function killProcessTreeUnix(pid: number, graceMs: number, useGroupKill: boolean): void {
|
||||
// Step 1: Try graceful SIGTERM. Prefer process-group kill (`-pid`) when the
|
||||
// child was spawned detached so it has its own group; otherwise stick to the
|
||||
// direct pid to avoid SIGTERMing our own process group (the gateway). (#71662)
|
||||
if (useGroupKill) {
|
||||
try {
|
||||
process.kill(-pid, "SIGTERM");
|
||||
} catch {
|
||||
// Process group doesn't exist or we lack permission - try direct
|
||||
try {
|
||||
process.kill(pid, "SIGTERM");
|
||||
} catch {
|
||||
// Already gone
|
||||
return;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
try {
|
||||
process.kill(pid, "SIGTERM");
|
||||
} catch {
|
||||
@@ -59,7 +78,7 @@ function killProcessTreeUnix(pid: number, graceMs: number): void {
|
||||
|
||||
// Step 2: Wait grace period, then SIGKILL if still alive
|
||||
setTimeout(() => {
|
||||
if (isProcessAlive(-pid)) {
|
||||
if (useGroupKill && isProcessAlive(-pid)) {
|
||||
try {
|
||||
process.kill(-pid, "SIGKILL");
|
||||
return;
|
||||
|
||||
@@ -122,10 +122,46 @@ describe("createChildAdapter", () => {
|
||||
|
||||
adapter.kill();
|
||||
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(4321);
|
||||
// Detachment flag is now passed to killProcessTree so it knows whether
|
||||
// it can safely group-kill via -pid. (#71662)
|
||||
const expectedDetached = process.platform !== "win32" && !process.env.OPENCLAW_SERVICE_MARKER;
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(4321, { detached: expectedDetached });
|
||||
expect(killMock).toHaveBeenCalledWith("SIGKILL");
|
||||
});
|
||||
|
||||
it("passes detached:false to killProcessTree when spawn fell back to no-detach (#71662 follow-up)", async () => {
|
||||
// Simulate the fallback scenario: spawnWithFallback retried with
|
||||
// detached:false because the initial detached spawn failed. The kill
|
||||
// closure must NOT group-kill since the child shares the gateway's group.
|
||||
const { child, killMock } = createStubChild(8888);
|
||||
spawnWithFallbackMock.mockResolvedValue({
|
||||
child,
|
||||
usedFallback: true,
|
||||
fallbackLabel: "no-detach",
|
||||
});
|
||||
const adapter = await createChildAdapter({
|
||||
argv: ["node", "-e", "setTimeout(() => {}, 1000)"],
|
||||
stdinMode: "pipe-open",
|
||||
});
|
||||
|
||||
adapter.kill();
|
||||
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(8888, { detached: false });
|
||||
expect(killMock).toHaveBeenCalledWith("SIGKILL");
|
||||
});
|
||||
|
||||
it("passes detached:false in service-managed mode where useDetached is false from the start (#71662)", async () => {
|
||||
process.env.OPENCLAW_SERVICE_MARKER = "1";
|
||||
try {
|
||||
const { adapter, killMock } = await createAdapterHarness({ pid: 9999 });
|
||||
adapter.kill();
|
||||
expect(killProcessTreeMock).toHaveBeenCalledWith(9999, { detached: false });
|
||||
expect(killMock).toHaveBeenCalledWith("SIGKILL");
|
||||
} finally {
|
||||
delete process.env.OPENCLAW_SERVICE_MARKER;
|
||||
}
|
||||
});
|
||||
|
||||
it("uses direct child.kill for non-SIGKILL signals", async () => {
|
||||
const { adapter, killMock } = await createAdapterHarness({ pid: 7654 });
|
||||
|
||||
|
||||
@@ -283,11 +283,20 @@ export async function createChildAdapter(params: {
|
||||
return waitPromise;
|
||||
};
|
||||
|
||||
// The actual detachment of the spawned child can differ from `useDetached`:
|
||||
// when the detached spawn fails, `spawnWithFallback` retries with the
|
||||
// `no-detach` fallback (detached:false). In that case the child shares the
|
||||
// gateway's process group regardless of intent, so the kill must avoid
|
||||
// group-kill. (#71662 follow-up — caught by Greptile review)
|
||||
const childIsDetached = useDetached && !spawned.usedFallback;
|
||||
const kill = (signal?: NodeJS.Signals) => {
|
||||
const pid = child.pid ?? undefined;
|
||||
if (signal === undefined || signal === "SIGKILL") {
|
||||
if (pid) {
|
||||
killProcessTree(pid);
|
||||
// Pass through whether the child is actually detached. Without this,
|
||||
// `killProcessTree` group-kills via `-pid` and takes out the gateway's
|
||||
// own process group along with the child. (#71662)
|
||||
killProcessTree(pid, { detached: childIsDetached });
|
||||
}
|
||||
try {
|
||||
child.kill("SIGKILL");
|
||||
|
||||
Reference in New Issue
Block a user