mirror of
https://github.com/openclaw/openclaw.git
synced 2026-05-06 10:30:44 +00:00
fix: prevent channel login exec wedges
This commit is contained in:
@@ -45,6 +45,8 @@ Docs: https://docs.openclaw.ai
|
||||
- Discord: own the Carbon interaction listener and hand off Discord slash/component handling asynchronously, so compaction or long session locks no longer trip `InteractionEventListener` listener timeouts. Fixes #73204. Thanks @slideshow-dingo.
|
||||
- Compaction/diagnostics: keep unknown compaction failure classifications stable while logging sanitized detail for unclassified provider errors such as missing Ollama provider adapters. Thanks @gzsiang.
|
||||
- Models/fallbacks: record first-class `model.fallback_step` trajectory events with from/to models, failure detail, chain position, and final outcome so support exports preserve the primary model failure even when a later fallback also fails. Fixes #71744. Thanks @nikolaykazakovvs-ux.
|
||||
- Gateway/agents: block agent `exec` from launching interactive `openclaw channels login` flows and abort active agent runs after invalid-config recovery restores last-known-good config, preventing known channel-login and reload paths from wedging replies. Refs #72338. Thanks @midhunmonachan.
|
||||
- Gateway/diagnostics: emit payload-free liveness warnings with event-loop delay, event-loop utilization, CPU-core ratio, and active-session counts so live-but-stalled Gateways capture CPU-spin context in stability bundles. Refs #72338. Thanks @midhunmonachan and @DougButdorf.
|
||||
- Gateway/startup: keep value-option foreground starts on the gateway fast path and skip proxy bootstrap unless proxy env is configured, reducing normal gateway startup RSS and avoiding full CLI graph loading. Thanks @vincentkoc.
|
||||
- Heartbeat/models: show heartbeat model bleed guidance on context-overflow resets when the last runtime model matches configured `heartbeat.model`, so smaller local heartbeat models point users to `isolatedSession` or `lightContext` instead of only compaction-buffer tuning. Fixes #67314. Thanks @Knightmare6890.
|
||||
- Subagents/models: persist `sessions_spawn.model` and configured subagent models as child-session model overrides before the first turn, so spawned subagents actually run on the requested provider/model instead of reverting to the target agent default. Fixes #73180. Thanks @danielzinhu99.
|
||||
|
||||
@@ -92,6 +92,7 @@ openclaw channels logout --channel whatsapp
|
||||
|
||||
- `channels login` supports `--verbose`.
|
||||
- `channels login` and `logout` can infer the channel when only one supported login target is configured.
|
||||
- Run `channels login` from a terminal on the gateway host. Agent `exec` blocks this interactive login flow; channel-native agent login tools, such as `whatsapp_login`, should be used from chat when available.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
||||
@@ -71,6 +71,12 @@ export keeps only that a message was omitted and the byte count.
|
||||
The Gateway records a bounded, payload-free stability stream by default when
|
||||
diagnostics are enabled. It is for operational facts, not content.
|
||||
|
||||
The same diagnostic heartbeat records liveness warnings when the Gateway keeps
|
||||
running but the Node.js event loop or CPU looks saturated. These
|
||||
`diagnostic.liveness.warning` events include event-loop delay, event-loop
|
||||
utilization, CPU-core ratio, and active/waiting/queued session counts. They do
|
||||
not restart the Gateway by themselves.
|
||||
|
||||
Inspect the live recorder:
|
||||
|
||||
```bash
|
||||
|
||||
@@ -24,7 +24,7 @@ Short guide to verify channel connectivity without guessing.
|
||||
- Creds on disk: `ls -l ~/.openclaw/credentials/whatsapp/<accountId>/creds.json` (mtime should be recent).
|
||||
- Session store: `ls -l ~/.openclaw/agents/<agentId>/sessions/sessions.json` (path can be overridden in config). Count and recent recipients are surfaced via `status`.
|
||||
- Relink flow: `openclaw channels logout && openclaw channels login --verbose` when status codes 409–515 or `loggedOut` appear in logs. (Note: the QR login flow auto-restarts once for status 515 after pairing.)
|
||||
- Diagnostics are enabled by default. The gateway records operational facts unless `diagnostics.enabled: false` is set. Memory events record RSS/heap byte counts, threshold pressure, and growth pressure. Oversized-payload events record what was rejected, truncated, or chunked, plus sizes and limits when available. They do not record the message text, attachment contents, webhook body, raw request or response body, tokens, cookies, or secret values. The same heartbeat starts the bounded stability recorder, which is available through `openclaw gateway stability` or the `diagnostics.stability` Gateway RPC. Fatal Gateway exits, shutdown timeouts, and restart startup failures persist the latest recorder snapshot under `~/.openclaw/logs/stability/` when events exist; inspect the newest saved bundle with `openclaw gateway stability --bundle latest`.
|
||||
- Diagnostics are enabled by default. The gateway records operational facts unless `diagnostics.enabled: false` is set. Memory events record RSS/heap byte counts, threshold pressure, and growth pressure. Liveness warnings record event-loop delay, event-loop utilization, CPU-core ratio, and active/waiting/queued session counts when the process is running but saturated. Oversized-payload events record what was rejected, truncated, or chunked, plus sizes and limits when available. They do not record the message text, attachment contents, webhook body, raw request or response body, tokens, cookies, or secret values. The same heartbeat starts the bounded stability recorder, which is available through `openclaw gateway stability` or the `diagnostics.stability` Gateway RPC. Fatal Gateway exits, shutdown timeouts, and restart startup failures persist the latest recorder snapshot under `~/.openclaw/logs/stability/` when events exist; inspect the newest saved bundle with `openclaw gateway stability --bundle latest`.
|
||||
- For bug reports, run `openclaw gateway diagnostics export` and attach the generated zip. The export combines a Markdown summary, the newest stability bundle, sanitized log metadata, sanitized Gateway status/health snapshots, and config shape. It is meant to be shared: chat text, webhook bodies, tool outputs, credentials, cookies, account/message identifiers, and secret values are omitted or redacted. See [Diagnostics Export](/gateway/diagnostics).
|
||||
|
||||
## Health monitor config
|
||||
|
||||
@@ -78,6 +78,7 @@ Notes:
|
||||
- Host execution (`gateway`/`node`) rejects `env.PATH` and loader overrides (`LD_*`/`DYLD_*`) to
|
||||
prevent binary hijacking or injected code.
|
||||
- OpenClaw sets `OPENCLAW_SHELL=exec` in the spawned command environment (including PTY and sandbox execution) so shell/profile rules can detect exec-tool context.
|
||||
- `openclaw channels login` is blocked from `exec` because it is an interactive channel-auth flow; run it in a terminal on the gateway host, or use the channel-native login tool from chat when one exists.
|
||||
- Important: sandboxing is **off by default**. If sandboxing is off, implicit `host=auto`
|
||||
resolves to `gateway`. Explicit `host=sandbox` still fails closed instead of silently
|
||||
running on the gateway host. Enable sandboxing or use `host=gateway` with approvals.
|
||||
|
||||
@@ -24,6 +24,7 @@ const isWin = process.platform === "win32";
|
||||
|
||||
const describeNonWin = isWin ? describe.skip : describe;
|
||||
const describeWin = isWin ? describe : describe.skip;
|
||||
const parseOpenClawChannelsLoginShellCommand = __testing.parseOpenClawChannelsLoginShellCommand;
|
||||
const validateExecScriptPreflight = __testing.validateScriptFileForShellBleed;
|
||||
const createPreflightTool = () =>
|
||||
createExecTool({ host: "gateway", security: "full", ask: "on-miss" });
|
||||
@@ -66,6 +67,35 @@ async function expectSymlinkSwapDuringPreflightToAvoidErrors(params: {
|
||||
});
|
||||
}
|
||||
|
||||
describe("exec interactive OpenClaw channel login guard", () => {
|
||||
it("recognizes direct and package-runner channel login commands before execution", () => {
|
||||
expect(
|
||||
parseOpenClawChannelsLoginShellCommand("openclaw channels login --channel whatsapp"),
|
||||
).toBe(true);
|
||||
expect(
|
||||
parseOpenClawChannelsLoginShellCommand(
|
||||
"pnpm exec openclaw channels login --channel whatsapp --verbose",
|
||||
),
|
||||
).toBe(true);
|
||||
expect(parseOpenClawChannelsLoginShellCommand("openclaw channels status --deep")).toBe(false);
|
||||
});
|
||||
|
||||
it("blocks interactive channel login commands from exec", async () => {
|
||||
const tool = createPreflightTool();
|
||||
|
||||
await expect(
|
||||
tool.execute("call-openclaw-channel-login", {
|
||||
command: "openclaw channels login --channel whatsapp --verbose",
|
||||
}),
|
||||
).rejects.toThrow(/exec cannot run interactive OpenClaw channel login commands/);
|
||||
await expect(
|
||||
tool.execute("call-wrapped-openclaw-channel-login", {
|
||||
command: "sudo -u openclaw bash -lc 'openclaw channels login --channel whatsapp'",
|
||||
}),
|
||||
).rejects.toThrow(/exec cannot run interactive OpenClaw channel login commands/);
|
||||
});
|
||||
});
|
||||
|
||||
describeNonWin("exec script preflight", () => {
|
||||
it("blocks shell env var injection tokens in python scripts before execution", async () => {
|
||||
await withTempDir("openclaw-exec-preflight-", async (tmp) => {
|
||||
|
||||
@@ -1074,7 +1074,69 @@ function parseExecApprovalShellCommand(raw: string): ParsedExecApprovalCommand |
|
||||
};
|
||||
}
|
||||
|
||||
function rejectExecApprovalShellCommand(command: string): void {
|
||||
function normalizeCommandBaseName(token: string | undefined): string {
|
||||
if (!token) {
|
||||
return "";
|
||||
}
|
||||
const base = normalizeLowercaseStringOrEmpty(token.split(/[\\/]/u).at(-1));
|
||||
return base.replace(/\.(?:cmd|exe)$/u, "");
|
||||
}
|
||||
|
||||
function stripOpenClawPackageRunner(argv: string[]): string[] {
|
||||
const commandName = normalizeCommandBaseName(argv[0]);
|
||||
if (commandName === "openclaw") {
|
||||
return argv;
|
||||
}
|
||||
if (
|
||||
(commandName === "pnpm" || commandName === "npm" || commandName === "yarn") &&
|
||||
normalizeCommandBaseName(argv[1]) === "openclaw"
|
||||
) {
|
||||
return argv.slice(1);
|
||||
}
|
||||
if (
|
||||
(commandName === "pnpm" || commandName === "npm" || commandName === "yarn") &&
|
||||
(argv[1] === "exec" || argv[1] === "dlx" || argv[1] === "run") &&
|
||||
normalizeCommandBaseName(argv[2]) === "openclaw"
|
||||
) {
|
||||
return argv.slice(2);
|
||||
}
|
||||
if (commandName === "npx" || commandName === "bunx") {
|
||||
let idx = 1;
|
||||
while (idx < argv.length) {
|
||||
const token = argv[idx];
|
||||
if (token === "--") {
|
||||
idx += 1;
|
||||
break;
|
||||
}
|
||||
if (!token.startsWith("-") || token === "-") {
|
||||
break;
|
||||
}
|
||||
idx += 1;
|
||||
if ((token === "-p" || token === "--package") && idx < argv.length) {
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
if (normalizeCommandBaseName(argv[idx]) === "openclaw") {
|
||||
return argv.slice(idx);
|
||||
}
|
||||
}
|
||||
return argv;
|
||||
}
|
||||
|
||||
function parseOpenClawChannelsLoginShellCommand(raw: string): boolean {
|
||||
const argv = splitShellArgs(raw);
|
||||
if (!argv) {
|
||||
return false;
|
||||
}
|
||||
const openclawArgv = stripOpenClawPackageRunner(argv);
|
||||
return (
|
||||
normalizeCommandBaseName(openclawArgv[0]) === "openclaw" &&
|
||||
(openclawArgv[1] === "channels" || openclawArgv[1] === "channel") &&
|
||||
openclawArgv[2] === "login"
|
||||
);
|
||||
}
|
||||
|
||||
function rejectUnsafeControlShellCommand(command: string): void {
|
||||
const isEnvAssignmentToken = (token: string): boolean =>
|
||||
/^[A-Za-z_][A-Za-z0-9_]*=.*$/u.test(token);
|
||||
const shellWrappers = new Set(["bash", "dash", "fish", "ksh", "sh", "zsh"]);
|
||||
@@ -1295,15 +1357,22 @@ function rejectExecApprovalShellCommand(command: string): void {
|
||||
return argv ? buildCandidates(argv) : [line];
|
||||
});
|
||||
for (const candidate of candidates) {
|
||||
if (!parseExecApprovalShellCommand(candidate)) {
|
||||
continue;
|
||||
if (parseExecApprovalShellCommand(candidate)) {
|
||||
throw new Error(
|
||||
[
|
||||
"exec cannot run /approve commands.",
|
||||
"Show the /approve command to the user as chat text, or route it through the approval command handler instead of shell execution.",
|
||||
].join(" "),
|
||||
);
|
||||
}
|
||||
if (parseOpenClawChannelsLoginShellCommand(candidate)) {
|
||||
throw new Error(
|
||||
[
|
||||
"exec cannot run interactive OpenClaw channel login commands.",
|
||||
"Run `openclaw channels login` in a terminal on the gateway host, or use the channel-specific login agent tool when available (for WhatsApp: `whatsapp_login`).",
|
||||
].join(" "),
|
||||
);
|
||||
}
|
||||
throw new Error(
|
||||
[
|
||||
"exec cannot run /approve commands.",
|
||||
"Show the /approve command to the user as chat text, or route it through the approval command handler instead of shell execution.",
|
||||
].join(" "),
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1532,7 +1601,7 @@ export function createExecTool(
|
||||
const rawWorkdir = explicitWorkdir ?? defaultWorkdir ?? process.cwd();
|
||||
workdir = resolveWorkdir(rawWorkdir, warnings);
|
||||
}
|
||||
rejectExecApprovalShellCommand(params.command);
|
||||
rejectUnsafeControlShellCommand(params.command);
|
||||
|
||||
const inheritedBaseEnv = coerceEnv(process.env);
|
||||
const hostEnvResult =
|
||||
@@ -1813,5 +1882,6 @@ export function createExecTool(
|
||||
export const execTool = createExecTool();
|
||||
|
||||
export const __testing = {
|
||||
parseOpenClawChannelsLoginShellCommand,
|
||||
validateScriptFileForShellBleed,
|
||||
};
|
||||
|
||||
51
src/gateway/server-reload-handlers.test.ts
Normal file
51
src/gateway/server-reload-handlers.test.ts
Normal file
@@ -0,0 +1,51 @@
|
||||
import { afterEach, describe, expect, it, vi } from "vitest";
|
||||
import {
|
||||
__testing as embeddedRunTesting,
|
||||
clearActiveEmbeddedRun,
|
||||
setActiveEmbeddedRun,
|
||||
type EmbeddedPiQueueHandle,
|
||||
} from "../agents/pi-embedded-runner/runs.js";
|
||||
import { __testing } from "./server-reload-handlers.js";
|
||||
|
||||
describe("gateway reload recovery handlers", () => {
|
||||
afterEach(() => {
|
||||
embeddedRunTesting.resetActiveEmbeddedRuns();
|
||||
});
|
||||
|
||||
it("aborts active agent runs after last-known-good config recovery", () => {
|
||||
const sessionId = "config-recovery-session";
|
||||
const sessionKey = "agent:main:telegram:direct:123";
|
||||
let handle!: EmbeddedPiQueueHandle;
|
||||
handle = {
|
||||
abort: vi.fn(() => {
|
||||
clearActiveEmbeddedRun(sessionId, handle, sessionKey);
|
||||
}),
|
||||
isCompacting: () => false,
|
||||
isStreaming: () => false,
|
||||
queueMessage: async () => {},
|
||||
};
|
||||
const logReload = { info: vi.fn(), warn: vi.fn() };
|
||||
setActiveEmbeddedRun(sessionId, handle, sessionKey);
|
||||
|
||||
__testing.abortActiveAgentRunsAfterConfigRecovery({
|
||||
reason: "invalid-config",
|
||||
logReload,
|
||||
});
|
||||
|
||||
expect(handle.abort).toHaveBeenCalledOnce();
|
||||
expect(logReload.warn).toHaveBeenCalledWith(
|
||||
"config recovery aborted active agent run(s) after reload-invalid-config",
|
||||
);
|
||||
});
|
||||
|
||||
it("does not warn when config recovery has no active agent runs to abort", () => {
|
||||
const logReload = { info: vi.fn(), warn: vi.fn() };
|
||||
|
||||
__testing.abortActiveAgentRunsAfterConfigRecovery({
|
||||
reason: "invalid-config",
|
||||
logReload,
|
||||
});
|
||||
|
||||
expect(logReload.warn).not.toHaveBeenCalled();
|
||||
});
|
||||
});
|
||||
@@ -1,6 +1,7 @@
|
||||
import { resetModelCatalogCache } from "../agents/model-catalog.js";
|
||||
import { disposeAllSessionMcpRuntimes } from "../agents/pi-bundle-mcp-tools.js";
|
||||
import { getActiveEmbeddedRunCount } from "../agents/pi-embedded-runner/run-state.js";
|
||||
import { abortEmbeddedPiRun } from "../agents/pi-embedded-runner/runs.js";
|
||||
import { getTotalPendingReplies } from "../auto-reply/reply/dispatcher-registry.js";
|
||||
import type { CliDeps } from "../cli/deps.types.js";
|
||||
import { isRestartEnabled } from "../config/commands.flags.js";
|
||||
@@ -59,6 +60,23 @@ const MCP_RUNTIME_RELOAD_DISPOSE_TIMEOUT_MS = 5_000;
|
||||
const CHANNEL_RELOAD_DEFERRAL_POLL_MS = 500;
|
||||
const CHANNEL_RELOAD_STILL_PENDING_WARN_MS = 30_000;
|
||||
|
||||
function abortActiveAgentRunsAfterConfigRecovery(params: {
|
||||
reason: string;
|
||||
logReload: GatewayReloadLog;
|
||||
}) {
|
||||
const aborted = abortEmbeddedPiRun(undefined, { mode: "all" });
|
||||
if (!aborted) {
|
||||
return;
|
||||
}
|
||||
params.logReload.warn(
|
||||
`config recovery aborted active agent run(s) after reload-${params.reason}`,
|
||||
);
|
||||
}
|
||||
|
||||
export const __testing = {
|
||||
abortActiveAgentRunsAfterConfigRecovery,
|
||||
};
|
||||
|
||||
async function disposeMcpRuntimesWithTimeout(params: {
|
||||
dispose: () => Promise<void>;
|
||||
timeoutMs: number;
|
||||
@@ -418,6 +436,7 @@ export function startManagedGatewayConfigReloader(params: ManagedGatewayConfigRe
|
||||
await params.recoverSnapshot({ snapshot, reason: `reload-${reason}` }),
|
||||
promoteSnapshot: async (snapshot, _reason) => await params.promoteSnapshot(snapshot),
|
||||
onRecovered: ({ reason, snapshot, recoveredSnapshot }) => {
|
||||
abortActiveAgentRunsAfterConfigRecovery({ reason, logReload: params.logReload });
|
||||
enqueueConfigRecoveryNotice({
|
||||
cfg: recoveredSnapshot.config,
|
||||
phase: "reload",
|
||||
|
||||
@@ -166,6 +166,24 @@ export type DiagnosticHeartbeatEvent = DiagnosticBaseEvent & {
|
||||
queued: number;
|
||||
};
|
||||
|
||||
export type DiagnosticLivenessWarningReason = "event_loop_delay" | "event_loop_utilization" | "cpu";
|
||||
|
||||
export type DiagnosticLivenessWarningEvent = DiagnosticBaseEvent & {
|
||||
type: "diagnostic.liveness.warning";
|
||||
reasons: DiagnosticLivenessWarningReason[];
|
||||
intervalMs: number;
|
||||
eventLoopDelayP99Ms?: number;
|
||||
eventLoopDelayMaxMs?: number;
|
||||
eventLoopUtilization?: number;
|
||||
cpuUserMs?: number;
|
||||
cpuSystemMs?: number;
|
||||
cpuTotalMs?: number;
|
||||
cpuCoreRatio?: number;
|
||||
active: number;
|
||||
waiting: number;
|
||||
queued: number;
|
||||
};
|
||||
|
||||
export type DiagnosticToolLoopEvent = DiagnosticBaseEvent & {
|
||||
type: "tool.loop";
|
||||
sessionKey?: string;
|
||||
@@ -441,6 +459,7 @@ export type DiagnosticEventPayload =
|
||||
| DiagnosticLaneDequeueEvent
|
||||
| DiagnosticRunAttemptEvent
|
||||
| DiagnosticHeartbeatEvent
|
||||
| DiagnosticLivenessWarningEvent
|
||||
| DiagnosticToolLoopEvent
|
||||
| DiagnosticToolExecutionStartedEvent
|
||||
| DiagnosticToolExecutionCompletedEvent
|
||||
|
||||
@@ -45,6 +45,10 @@ export type DiagnosticStabilityEventRecord = {
|
||||
thresholdBytes?: number;
|
||||
rssGrowthBytes?: number;
|
||||
windowMs?: number;
|
||||
eventLoopDelayP99Ms?: number;
|
||||
eventLoopDelayMaxMs?: number;
|
||||
eventLoopUtilization?: number;
|
||||
cpuCoreRatio?: number;
|
||||
ageMs?: number;
|
||||
queueDepth?: number;
|
||||
queueSize?: number;
|
||||
@@ -266,6 +270,19 @@ function sanitizeDiagnosticEvent(event: DiagnosticEventPayload): DiagnosticStabi
|
||||
record.waiting = event.waiting;
|
||||
record.queued = event.queued;
|
||||
break;
|
||||
case "diagnostic.liveness.warning":
|
||||
record.level = "warning";
|
||||
record.durationMs = event.intervalMs;
|
||||
record.count = event.reasons.length;
|
||||
assignReasonCode(record, event.reasons[0]);
|
||||
record.eventLoopDelayP99Ms = event.eventLoopDelayP99Ms;
|
||||
record.eventLoopDelayMaxMs = event.eventLoopDelayMaxMs;
|
||||
record.eventLoopUtilization = event.eventLoopUtilization;
|
||||
record.cpuCoreRatio = event.cpuCoreRatio;
|
||||
record.active = event.active;
|
||||
record.waiting = event.waiting;
|
||||
record.queued = event.queued;
|
||||
break;
|
||||
case "tool.loop":
|
||||
record.toolName = event.toolName;
|
||||
record.level = event.level;
|
||||
|
||||
@@ -191,7 +191,7 @@ describe("stuck session diagnostics threshold", () => {
|
||||
enabled: true,
|
||||
},
|
||||
},
|
||||
{ emitMemorySample },
|
||||
{ emitMemorySample, sampleLiveness: () => null },
|
||||
);
|
||||
|
||||
vi.advanceTimersByTime(30_000);
|
||||
@@ -203,6 +203,93 @@ describe("stuck session diagnostics threshold", () => {
|
||||
expect(emitMemorySample).toHaveBeenLastCalledWith({ emitSample: true });
|
||||
});
|
||||
|
||||
it("emits idle liveness warnings into the stability recorder", () => {
|
||||
const emitMemorySample = createEmitMemorySampleMock();
|
||||
const events: string[] = [];
|
||||
const unsubscribe = onDiagnosticEvent((event) => events.push(event.type));
|
||||
|
||||
try {
|
||||
startDiagnosticHeartbeat(
|
||||
{
|
||||
diagnostics: {
|
||||
enabled: true,
|
||||
},
|
||||
},
|
||||
{
|
||||
emitMemorySample,
|
||||
sampleLiveness: () => ({
|
||||
reasons: ["cpu"],
|
||||
intervalMs: 30_000,
|
||||
eventLoopDelayP99Ms: 12,
|
||||
eventLoopDelayMaxMs: 22,
|
||||
eventLoopUtilization: 0.99,
|
||||
cpuUserMs: 29_000,
|
||||
cpuSystemMs: 1_000,
|
||||
cpuTotalMs: 30_000,
|
||||
cpuCoreRatio: 1,
|
||||
}),
|
||||
},
|
||||
);
|
||||
|
||||
vi.advanceTimersByTime(30_000);
|
||||
} finally {
|
||||
unsubscribe();
|
||||
}
|
||||
|
||||
expect(events).toContain("diagnostic.liveness.warning");
|
||||
expect(emitMemorySample).toHaveBeenLastCalledWith({ emitSample: true });
|
||||
expect(getDiagnosticStabilitySnapshot({ limit: 10 }).events).toContainEqual(
|
||||
expect.objectContaining({
|
||||
type: "diagnostic.liveness.warning",
|
||||
level: "warning",
|
||||
reason: "cpu",
|
||||
durationMs: 30_000,
|
||||
count: 1,
|
||||
eventLoopDelayP99Ms: 12,
|
||||
eventLoopDelayMaxMs: 22,
|
||||
eventLoopUtilization: 0.99,
|
||||
cpuCoreRatio: 1,
|
||||
active: 0,
|
||||
waiting: 0,
|
||||
queued: 0,
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
it("throttles repeated liveness warnings", () => {
|
||||
const events: string[] = [];
|
||||
const unsubscribe = onDiagnosticEvent((event) => events.push(event.type));
|
||||
|
||||
try {
|
||||
startDiagnosticHeartbeat(
|
||||
{
|
||||
diagnostics: {
|
||||
enabled: true,
|
||||
},
|
||||
},
|
||||
{
|
||||
emitMemorySample: createEmitMemorySampleMock(),
|
||||
sampleLiveness: () => ({
|
||||
reasons: ["event_loop_delay"],
|
||||
intervalMs: 30_000,
|
||||
eventLoopDelayP99Ms: 1_500,
|
||||
eventLoopDelayMaxMs: 2_000,
|
||||
}),
|
||||
},
|
||||
);
|
||||
|
||||
vi.advanceTimersByTime(30_000);
|
||||
vi.advanceTimersByTime(90_000);
|
||||
expect(events.filter((event) => event === "diagnostic.liveness.warning")).toHaveLength(1);
|
||||
|
||||
vi.advanceTimersByTime(30_000);
|
||||
} finally {
|
||||
unsubscribe();
|
||||
}
|
||||
|
||||
expect(events.filter((event) => event === "diagnostic.liveness.warning")).toHaveLength(2);
|
||||
});
|
||||
|
||||
it("does not start the heartbeat when diagnostics are disabled by config", () => {
|
||||
const emitMemorySample = createEmitMemorySampleMock();
|
||||
|
||||
|
||||
@@ -1,9 +1,11 @@
|
||||
import { monitorEventLoopDelay, performance } from "node:perf_hooks";
|
||||
import { getRuntimeConfig } from "../config/config.js";
|
||||
import type { OpenClawConfig } from "../config/types.openclaw.js";
|
||||
import {
|
||||
areDiagnosticsEnabledForProcess,
|
||||
emitDiagnosticEvent,
|
||||
isDiagnosticsEnabled,
|
||||
type DiagnosticLivenessWarningReason,
|
||||
} from "../infra/diagnostic-events.js";
|
||||
import { emitDiagnosticMemorySample, resetDiagnosticMemoryForTest } from "./diagnostic-memory.js";
|
||||
import {
|
||||
@@ -44,11 +46,18 @@ const DEFAULT_STUCK_SESSION_WARN_MS = 120_000;
|
||||
const MIN_STUCK_SESSION_WARN_MS = 1_000;
|
||||
const MAX_STUCK_SESSION_WARN_MS = 24 * 60 * 60 * 1000;
|
||||
const RECENT_DIAGNOSTIC_ACTIVITY_MS = 120_000;
|
||||
const DEFAULT_LIVENESS_EVENT_LOOP_DELAY_WARN_MS = 1_000;
|
||||
const DEFAULT_LIVENESS_EVENT_LOOP_UTILIZATION_WARN = 0.95;
|
||||
const DEFAULT_LIVENESS_CPU_CORE_RATIO_WARN = 0.9;
|
||||
const DEFAULT_LIVENESS_WARN_COOLDOWN_MS = 120_000;
|
||||
let commandPollBackoffRuntimePromise: Promise<
|
||||
typeof import("../agents/command-poll-backoff.runtime.js")
|
||||
> | null = null;
|
||||
|
||||
type EmitDiagnosticMemorySample = typeof emitDiagnosticMemorySample;
|
||||
type EventLoopDelayMonitor = ReturnType<typeof monitorEventLoopDelay>;
|
||||
type EventLoopUtilization = ReturnType<typeof performance.eventLoopUtilization>;
|
||||
type CpuUsage = ReturnType<typeof process.cpuUsage>;
|
||||
|
||||
type DiagnosticWorkSnapshot = {
|
||||
activeCount: number;
|
||||
@@ -56,6 +65,35 @@ type DiagnosticWorkSnapshot = {
|
||||
queuedCount: number;
|
||||
};
|
||||
|
||||
type DiagnosticLivenessSample = {
|
||||
reasons: DiagnosticLivenessWarningReason[];
|
||||
intervalMs: number;
|
||||
eventLoopDelayP99Ms?: number;
|
||||
eventLoopDelayMaxMs?: number;
|
||||
eventLoopUtilization?: number;
|
||||
cpuUserMs?: number;
|
||||
cpuSystemMs?: number;
|
||||
cpuTotalMs?: number;
|
||||
cpuCoreRatio?: number;
|
||||
};
|
||||
|
||||
type SampleDiagnosticLiveness = (
|
||||
now: number,
|
||||
work: DiagnosticWorkSnapshot,
|
||||
) => DiagnosticLivenessSample | null;
|
||||
|
||||
type StartDiagnosticHeartbeatOptions = {
|
||||
getConfig?: () => OpenClawConfig;
|
||||
emitMemorySample?: EmitDiagnosticMemorySample;
|
||||
sampleLiveness?: SampleDiagnosticLiveness;
|
||||
};
|
||||
|
||||
let diagnosticLivenessMonitor: EventLoopDelayMonitor | null = null;
|
||||
let lastDiagnosticLivenessWallAt = 0;
|
||||
let lastDiagnosticLivenessCpuUsage: CpuUsage | null = null;
|
||||
let lastDiagnosticLivenessEventLoopUtilization: EventLoopUtilization | null = null;
|
||||
let lastDiagnosticLivenessWarnAt = 0;
|
||||
|
||||
function loadCommandPollBackoffRuntime() {
|
||||
commandPollBackoffRuntimePromise ??= import("../agents/command-poll-backoff.runtime.js");
|
||||
return commandPollBackoffRuntimePromise;
|
||||
@@ -87,6 +125,159 @@ function hasRecentDiagnosticActivity(now: number): boolean {
|
||||
return lastActivityAt > 0 && now - lastActivityAt <= RECENT_DIAGNOSTIC_ACTIVITY_MS;
|
||||
}
|
||||
|
||||
function roundDiagnosticMetric(value: number, digits = 3): number {
|
||||
if (!Number.isFinite(value)) {
|
||||
return 0;
|
||||
}
|
||||
const factor = 10 ** digits;
|
||||
return Math.round(value * factor) / factor;
|
||||
}
|
||||
|
||||
function nanosecondsToMilliseconds(value: number): number {
|
||||
return roundDiagnosticMetric(value / 1_000_000, 1);
|
||||
}
|
||||
|
||||
function formatOptionalDiagnosticMetric(value: number | undefined): string {
|
||||
return value === undefined ? "unknown" : String(value);
|
||||
}
|
||||
|
||||
function startDiagnosticLivenessSampler(): void {
|
||||
lastDiagnosticLivenessWallAt = Date.now();
|
||||
lastDiagnosticLivenessCpuUsage = process.cpuUsage();
|
||||
lastDiagnosticLivenessEventLoopUtilization = performance.eventLoopUtilization();
|
||||
lastDiagnosticLivenessWarnAt = 0;
|
||||
|
||||
if (diagnosticLivenessMonitor) {
|
||||
diagnosticLivenessMonitor.reset();
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
diagnosticLivenessMonitor = monitorEventLoopDelay({ resolution: 20 });
|
||||
diagnosticLivenessMonitor.enable();
|
||||
diagnosticLivenessMonitor.reset();
|
||||
} catch (err) {
|
||||
diagnosticLivenessMonitor = null;
|
||||
diag.debug(`diagnostic liveness monitor unavailable: ${String(err)}`);
|
||||
}
|
||||
}
|
||||
|
||||
function stopDiagnosticLivenessSampler(): void {
|
||||
diagnosticLivenessMonitor?.disable();
|
||||
diagnosticLivenessMonitor = null;
|
||||
lastDiagnosticLivenessWallAt = 0;
|
||||
lastDiagnosticLivenessCpuUsage = null;
|
||||
lastDiagnosticLivenessEventLoopUtilization = null;
|
||||
lastDiagnosticLivenessWarnAt = 0;
|
||||
}
|
||||
|
||||
function sampleDiagnosticLiveness(now: number): DiagnosticLivenessSample | null {
|
||||
if (
|
||||
!diagnosticLivenessMonitor ||
|
||||
!lastDiagnosticLivenessCpuUsage ||
|
||||
!lastDiagnosticLivenessEventLoopUtilization ||
|
||||
lastDiagnosticLivenessWallAt <= 0
|
||||
) {
|
||||
startDiagnosticLivenessSampler();
|
||||
return null;
|
||||
}
|
||||
|
||||
const intervalMs = Math.max(1, now - lastDiagnosticLivenessWallAt);
|
||||
const cpuUsage = process.cpuUsage(lastDiagnosticLivenessCpuUsage);
|
||||
const currentEventLoopUtilization = performance.eventLoopUtilization();
|
||||
const eventLoopUtilization = performance.eventLoopUtilization(
|
||||
currentEventLoopUtilization,
|
||||
lastDiagnosticLivenessEventLoopUtilization,
|
||||
).utilization;
|
||||
const eventLoopDelayP99Ms = nanosecondsToMilliseconds(diagnosticLivenessMonitor.percentile(99));
|
||||
const eventLoopDelayMaxMs = nanosecondsToMilliseconds(diagnosticLivenessMonitor.max);
|
||||
diagnosticLivenessMonitor.reset();
|
||||
lastDiagnosticLivenessWallAt = now;
|
||||
lastDiagnosticLivenessCpuUsage = process.cpuUsage();
|
||||
lastDiagnosticLivenessEventLoopUtilization = currentEventLoopUtilization;
|
||||
|
||||
const cpuUserMs = roundDiagnosticMetric(cpuUsage.user / 1_000, 1);
|
||||
const cpuSystemMs = roundDiagnosticMetric(cpuUsage.system / 1_000, 1);
|
||||
const cpuTotalMs = roundDiagnosticMetric(cpuUserMs + cpuSystemMs, 1);
|
||||
const cpuCoreRatio = roundDiagnosticMetric(cpuTotalMs / intervalMs, 3);
|
||||
const eventLoopUtilizationRatio = roundDiagnosticMetric(eventLoopUtilization, 3);
|
||||
const reasons: DiagnosticLivenessWarningReason[] = [];
|
||||
|
||||
if (
|
||||
eventLoopDelayP99Ms >= DEFAULT_LIVENESS_EVENT_LOOP_DELAY_WARN_MS ||
|
||||
eventLoopDelayMaxMs >= DEFAULT_LIVENESS_EVENT_LOOP_DELAY_WARN_MS
|
||||
) {
|
||||
reasons.push("event_loop_delay");
|
||||
}
|
||||
if (eventLoopUtilizationRatio >= DEFAULT_LIVENESS_EVENT_LOOP_UTILIZATION_WARN) {
|
||||
reasons.push("event_loop_utilization");
|
||||
}
|
||||
if (cpuCoreRatio >= DEFAULT_LIVENESS_CPU_CORE_RATIO_WARN) {
|
||||
reasons.push("cpu");
|
||||
}
|
||||
if (reasons.length === 0) {
|
||||
return null;
|
||||
}
|
||||
|
||||
return {
|
||||
reasons,
|
||||
intervalMs,
|
||||
eventLoopDelayP99Ms,
|
||||
eventLoopDelayMaxMs,
|
||||
eventLoopUtilization: eventLoopUtilizationRatio,
|
||||
cpuUserMs,
|
||||
cpuSystemMs,
|
||||
cpuTotalMs,
|
||||
cpuCoreRatio,
|
||||
};
|
||||
}
|
||||
|
||||
function shouldEmitDiagnosticLivenessWarning(now: number): boolean {
|
||||
if (
|
||||
lastDiagnosticLivenessWarnAt > 0 &&
|
||||
now - lastDiagnosticLivenessWarnAt < DEFAULT_LIVENESS_WARN_COOLDOWN_MS
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
lastDiagnosticLivenessWarnAt = now;
|
||||
return true;
|
||||
}
|
||||
|
||||
function emitDiagnosticLivenessWarning(
|
||||
sample: DiagnosticLivenessSample,
|
||||
work: DiagnosticWorkSnapshot,
|
||||
): void {
|
||||
diag.warn(
|
||||
`liveness warning: reasons=${sample.reasons.join(",")} interval=${Math.round(
|
||||
sample.intervalMs / 1000,
|
||||
)}s eventLoopDelayP99Ms=${formatOptionalDiagnosticMetric(
|
||||
sample.eventLoopDelayP99Ms,
|
||||
)} eventLoopDelayMaxMs=${formatOptionalDiagnosticMetric(
|
||||
sample.eventLoopDelayMaxMs,
|
||||
)} eventLoopUtilization=${formatOptionalDiagnosticMetric(
|
||||
sample.eventLoopUtilization,
|
||||
)} cpuCoreRatio=${formatOptionalDiagnosticMetric(sample.cpuCoreRatio)} active=${
|
||||
work.activeCount
|
||||
} waiting=${work.waitingCount} queued=${work.queuedCount}`,
|
||||
);
|
||||
emitDiagnosticEvent({
|
||||
type: "diagnostic.liveness.warning",
|
||||
reasons: sample.reasons,
|
||||
intervalMs: sample.intervalMs,
|
||||
eventLoopDelayP99Ms: sample.eventLoopDelayP99Ms,
|
||||
eventLoopDelayMaxMs: sample.eventLoopDelayMaxMs,
|
||||
eventLoopUtilization: sample.eventLoopUtilization,
|
||||
cpuUserMs: sample.cpuUserMs,
|
||||
cpuSystemMs: sample.cpuSystemMs,
|
||||
cpuTotalMs: sample.cpuTotalMs,
|
||||
cpuCoreRatio: sample.cpuCoreRatio,
|
||||
active: work.activeCount,
|
||||
waiting: work.waitingCount,
|
||||
queued: work.queuedCount,
|
||||
});
|
||||
markActivity();
|
||||
}
|
||||
|
||||
export function resolveStuckSessionWarnMs(config?: OpenClawConfig): number {
|
||||
const raw = config?.diagnostics?.stuckSessionWarnMs;
|
||||
if (typeof raw !== "number" || !Number.isFinite(raw)) {
|
||||
@@ -393,7 +584,7 @@ let heartbeatInterval: NodeJS.Timeout | null = null;
|
||||
|
||||
export function startDiagnosticHeartbeat(
|
||||
config?: OpenClawConfig,
|
||||
opts?: { getConfig?: () => OpenClawConfig; emitMemorySample?: EmitDiagnosticMemorySample },
|
||||
opts?: StartDiagnosticHeartbeatOptions,
|
||||
) {
|
||||
if (!areDiagnosticsEnabledForProcess() || !isDiagnosticsEnabled(config)) {
|
||||
return;
|
||||
@@ -403,6 +594,7 @@ export function startDiagnosticHeartbeat(
|
||||
if (heartbeatInterval) {
|
||||
return;
|
||||
}
|
||||
startDiagnosticLivenessSampler();
|
||||
heartbeatInterval = setInterval(() => {
|
||||
let heartbeatConfig = config;
|
||||
if (!heartbeatConfig) {
|
||||
@@ -416,8 +608,11 @@ export function startDiagnosticHeartbeat(
|
||||
const now = Date.now();
|
||||
pruneDiagnosticSessionStates(now, true);
|
||||
const work = getDiagnosticWorkSnapshot();
|
||||
const livenessSample = (opts?.sampleLiveness ?? sampleDiagnosticLiveness)(now, work);
|
||||
const shouldEmitLivenessWarning =
|
||||
livenessSample !== null && shouldEmitDiagnosticLivenessWarning(now);
|
||||
const shouldRecordMemorySample =
|
||||
hasRecentDiagnosticActivity(now) || hasOpenDiagnosticWork(work);
|
||||
shouldEmitLivenessWarning || hasRecentDiagnosticActivity(now) || hasOpenDiagnosticWork(work);
|
||||
(opts?.emitMemorySample ?? emitDiagnosticMemorySample)({
|
||||
emitSample: shouldRecordMemorySample,
|
||||
});
|
||||
@@ -426,6 +621,10 @@ export function startDiagnosticHeartbeat(
|
||||
return;
|
||||
}
|
||||
|
||||
if (shouldEmitLivenessWarning && livenessSample) {
|
||||
emitDiagnosticLivenessWarning(livenessSample, work);
|
||||
}
|
||||
|
||||
diag.debug(
|
||||
`heartbeat: webhooks=${webhookStats.received}/${webhookStats.processed}/${webhookStats.errors} active=${work.activeCount} waiting=${work.waitingCount} queued=${work.queuedCount}`,
|
||||
);
|
||||
@@ -471,6 +670,7 @@ export function stopDiagnosticHeartbeat() {
|
||||
clearInterval(heartbeatInterval);
|
||||
heartbeatInterval = null;
|
||||
}
|
||||
stopDiagnosticLivenessSampler();
|
||||
stopDiagnosticStabilityRecorder();
|
||||
uninstallDiagnosticStabilityFatalHook();
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user